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Abstract. We develop results for the use of Lasso and Post-Lasso methods to form first-stage 
predictions and estimate optimal instruments in linear instrumental variables (IV) models with 
many instruments, p. Our results apply even when p is much larger than the sample size, 
n. We show that the IV estimator based on using Lasso or Post-Lasso in the first stage is 
root-n consistent and asymptotically normal when the first-stage is approximately sparse; i.e. 
when the conditional expectation of the endogenous variables given the instruments can be well- 
approximated by a relatively small set of variables whose identities may be unknown. We also 
show the estimator is semi-parametrically efficient when the structural error is homoscedastic. 
Notably our results allow for imperfect model selection, and do not rely upon the unrealistic 
"beta-min" conditions that are widely used to establish validity of inference following model 
selection. In simulation experiments, the Lasso-based IV estimator with a data-driven penalty 
performs well compared to recently advocated many-instrument-robust procedures. In an empir- 
ical example dealing with the effect of judicial eminent domain decisions on economic outcomes, 
the Lasso-based IV estimator outperforms an intuitive benchmark. 

Optimal instruments are conditional expectations. In developing the IV results, we estab- 
lish a series of new results for Lasso and Post-Lasso estimators of nonparametric conditional 
expectation functions which are of independent theoretical and practical interest. We construct 
a modification of Lasso designed to deal with non-Gaussian, heteroscedastic disturbances which 
uses a data- weighted £i-penalty function. By innovatively using moderate deviation theory for 
self-normalized sums, we provide convergence rates for the resulting Lasso and Post-Lasso esti- 
mators that are as sharp as the corresponding rates in the homoscedastic Gaussian case under 
the condition that logp = o(n 1 ^ 3 ). We also provide a data-driven method for choosing the 
penalty level that must be specified in obtaining Lasso and Post-Lasso estimates and establish 
its asymptotic validity under non-Gaussian, heteroscedastic disturbances. 
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1. Introduction 

Instrumental variables (IV) techniques are widely used in applied economic research. While 
these methods provide a useful tool for identifying structural effects of interest, their application 
often results in imprecise inference. One way to improve the precision of instrumental variables 
estimators is to use many instruments or to try to approximate the optimal instruments as in 
Amemiya (1974), Chamberlain (1987), and Newey (1990). Estimation of optimal instruments 
will generally be done nonparametrically and thus implicitly makes use of many constructed 
instruments such as polynomials. The promised improvement in efficiency is appealing, but IV 
estimators based on many instruments may have poor properties. See, for example, Bekker 
(1994), Chao and Swanson (2005), Hansen, Hausman, and Newey (2008), and Hausman, Newey, 
Woutersen, Chao, and Swanson (2009) which propose solutions for this problem based on "many- 
instrument" asymptoticsQ 

In this paper, we contribute to the literature on IV estimation with many instruments by con- 
sidering the use of Lasso and Post-Lasso for estimating the first-stage regression of endogenous 
variables on the instruments. Lasso is a widely used method that acts both as an estimator of 
regression functions and as a model selection device. Lasso solves for regression coefficients by 
minimizing the sum of the usual least squares objective function and a penalty for model size 
through the sum of the absolute values of the coefficients. The resulting Lasso estimator selects 
instruments and estimates the first-stage regression coefficients via a shrinkage procedure. The 
Post-Lasso estimator discards the Lasso coefficient estimates and uses the data-dependent set 
of instruments selected by Lasso to refit the first stage regression via OLS to alleviate Lasso's 
shrinkage bias. For theoretical and simulation evidence regarding Lasso's performance, see Bai 
and Ng (2008, 2009a), Bickel, Ritov, and Tsybakov (2009), Bunea, Tsybakov, and Wegkamp 
(2006, 2007a, 2007b), Candes and Tao (2007), Huang, Horowitz, and Wei (2010), Knight (2008), 
Koltchinskii (2009), Lounici (2008), Lounici, Pontil, Tsybakov, and van de Geer (2010), Mein- 
shausen and Yu (2009), Rosenbaum and Tsybakov (2008), Tibshirani (1996), van de Geer (2008), 
Wainwright (2009), Zhang and Huang (2008), Belloni and Chernozhukov (2012), and Biihlmann 
and van de Geer (2011) among many others. See Belloni and Chernozhukov (2012) for analogous 
results on Post-Lasso. 

Using Lasso-based methods to form first-stage predictions in IV estimation provides a practi- 
cal approach to obtaining the efficiency gains from using optimal instruments while dampening 
the problems associated with many instruments. We show that Lasso-based procedures pro- 
duce first-stage predictions that provide good approximations to the optimal instruments even 



1 It is important to note that the precise definition of "many-instrument" is p oc n with p < n where p is 
the number of instruments and n is the sample size. The current paper allows for this case and also for "very 
many-instrument" asymptotics where p> n. 
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when the number of available instruments is much larger than the sample size when the first- 
stage is approximately sparse - that is, when there exists a relatively small set of important 
instruments whose identities are unknown that well-approximate the conditional expectation 
of the endogenous variables given the instruments. Under approximate sparsity, estimating the 
first-stage relationship using Lasso-based procedures produces IV estimators that are root-n con- 
sistent and asymptotically normal. The IV estimator with Lasso-based first stage also achieves 
the semi-parametric efficiency bound under the additional condition that structural errors are 
homoscedastic. Our results allow imperfect model selection and do not impose "beta-min" con- 
ditions that restrict the minimum allowable magnitude of the coefficients on relevant regressors. 
We also provide a consistent asymptotic variance estimator. Thus, our results generalize the 
IV procedure of Newey (1990) and Hahn (2002) based on conventional series approximation of 
the optimal instruments. Our results also generalize Bickel, Ritov, and Tsybakov (2009) by 
providing inference and confidence sets for the second-stage IV estimator based on Lasso or 
Post-Lasso estimates of the first-stage predictions. To our knowledge, our result is the first 
to verify root-n consistency and asymptotic normality of an estimator for a low-dimensional 
structural parameter in a high-dimensional setting without imposing the very restrictive "beta- 
min" condition^] Our results also remain valid in the presence of heteroscedasticity and thus 
provide a useful complement to existing approaches in the many instrument literature which 
often rely on homoscedasticity and may be inconsistent in the presence of heteroscedasticity; see 
Hausman, Newey, Woutersen, Chao, and Swanson (2009) for a notable exception that allows for 
heteroscedasticity and gives additional discussion. 

Instrument selection procedures complement existing/traditional methods that are meant to 
be robust to many-instruments but are not a universal solution to the many instruments problem. 
The good performance of instrument selection procedures relies on approximate sparsity. Unlike 
traditional IV methods, instrument selection procedures do not require the identity of these 
"important" variables to be known a priori as the identity of these instruments will be estimated 
from the data. This flexibility comes with the cost that instrument selection will tend not to work 
well when the first-stage is not approximately sparse. When approximate sparsity breaks down, 
instrument selection procedures may select too few or no instruments or may select too many 
instruments. Two scenarios where this failure is likely to occur are the weak-instrument case; 
e.g. Staiger and Stock (1997), Andrews, Moreira, and Stock (2006), Andrews and Stock (2005), 
Moreira (2003), Kleibergen (2002), and Kleibergen (2005); and the many- weak-instrument case; 
e.g. Bekker (1994), Chao and Swanson (2005), Hansen, Hausman, and Newey (2008), and 

2 The "beta-min" condition requires the relevant coefficients in the regression to be separated from zero by a 
factor that exceeds the potential estimation error. This condition implies the identities of the relevant regressors 
may be perfectly determined. There is a large body of theoretical work that uses such a condition and thus 
implicitly assumes that the resulting post-model selection estimator is the same as the oracle estimator that 
knows the identities of the relevant regressors. See Biihlmann and van de Geer (2011) for the discussion of the 
"beta-min" condition and the theoretical role it plays in obtaining "oracle" results. 
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Hausman, Newey, Woutersen, Chao, and Swanson (2009). We consider two modifications of 
our basic procedure aimed at alleviating these concerns. In Section 4, we present a sup-score 
testing procedure that is related to Anderson and Rubin (1949) and Staiger and Stock (1997) 
but is better suited to cases with very many instruments; and we consider a split sample IV 
estimator in Section 5 which combines instrument selection via Lasso with the sample-splitting 
method of Angrist and Krueger (1995). While these two procedures are steps toward addressing 
weak identification concerns with very many instruments, further exploration of the interplay 
between weak-instrument or many-weak-instrument methods and variable selection would be an 
interesting avenue for additional research. 

Our paper also contributes to the growing literature on Lasso-based methods by providing 
results for Lasso-based estimators of nonparametric conditional expectations. We consider a 
modified Lasso estimator with penalty weights designed to deal with non-Gaussianity and het- 
eroscedastic errors. This new construction allows us to innovatively use the results of moderate 
deviation theory for self-normalized sums of Jing, Shao, and Wang (2003) to provide conver- 
gence rates for Lasso and Post-Lasso. The derived convergence rates are as sharp as in the 
homoscedastic Gaussian case under the weak condition that the log of the number of regressors 
p is small relative to n 1 / 3 , i.e. logp = o(n 1 / 3 ). Our construction generalizes the standard Lasso 
estimator of Tibshirani (1996) and allows us to generalize the Lasso results of Bickel, Ritov, 
and Tsybakov (2009) and Post-Lasso results of Belloni and Chernozhukov (2012) both of which 
assume homoscedasticity and Gaussianity. The construction as well as theoretical results are 
important for applied economic analysis where researchers are concerned about heteroscedastic- 
ity and non-Gaussianity in their data. We also provide a data-driven method for choosing the 
penalty that must be specified to obtain Lasso and Post-Lasso estimates, and we establish its 
asymptotic validity allowing for non-Gaussian, heteroscedastic disturbances. Ours is the first 
paper to provide such a data-driven penalty which was previously not available even in the 
Gaussian casej^] These results are of independent interest in a variety of theoretical and applied 
settings. 

We illustrate the performance of Lasso-based IV through simulation experiments. In these 
experiments, we find that a feasible Lasso-based procedure that uses our data-driven penalty 
performs well across a range of simulation designs where sparsity is a reasonable approximation. 
In terms of estimation risk, it outperforms the estimator of Fuller (1977) (FULL)|^] which is 
robust to many instruments (e.g. Hansen, Hausman, and Newey, 2008), except in a design 
where sparsity breaks down and the sample size is large relative to the number of instruments. 
In terms of size of 5% level tests, the Lasso-based IV estimator performs comparably to or better 

3 One exception is the work of Belloni, Chernozhukov, and Wang (2011b) which considers square-root-Lasso esti- 
mators and shows that their use allows for pivotal penalty choices. Those results strongly rely on homoscedasticity. 

4 Note that this procedure is only applicable when the number of instruments p is less than the sample size n. 
As mentioned earlier, procedures developed in this paper allow for p to be much larger n. 
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than FULL in all cases we consider. Overall, the simulation results are in line with the theory 
and favorable to the proposed Lasso-based IV procedures. 

Finally, we demonstrate the potential gains of the Lasso-based procedure in an application 
where there are many available instruments among which there is not a clear a priori way to 
decide which instruments to use. We look at the effect of judicial decisions at the federal circuit 
court level regarding the government's exercise of eminent domain on house prices and state- 
level GDP as in Chen and Yeh (2010). We follow the identification strategy of Chen and Yeh 
(2010) who use the random assignment of judges to three judge panels that are then assigned 
to eminent domain cases to justify using the demographic characteristics of the judges on the 
realized panels as instruments for their decision. This strategy produces a situation in which 
there are many potential instruments in that all possible sets of characteristics of the three judge 
panel are valid instruments. We find that the Lasso-based estimates using the data-dependent 
penalty produce much larger first-stage Wald-statistics and generally have smaller estimated 
second stage standard errors than estimates obtained using the baseline instruments of Chen 
and Yeh (2010). 

Relationship to econometric literature on variable selection and shrinkage. The 

idea of instrument selection goes back to Kloek and Mennes (1960) and Amemiya (1966) who 
searched among principal components to approximate the optimal instruments. Related ideas 
appear in dynamic factor models as in Bai and Ng (2010), Kapetanios and Marcellino (2010), 
and Kapetanios, Khalaf, and Marcellino (2011). Factor analysis differs from our approach 
though principal components, factors, ridge fits, and other functions of the instruments could 
be considered among the set of potential instruments to select fromj^] 

There are several other papers that explore the use of modern variable selection methods in 
econometrics, including some papers that apply these procedures to IV estimation. Bai and Ng 
(2009b) consider an approach to instrument selection that is closely related to ours based on 
boosting. The latter method is distinct from Lasso, cf. Biihlmann (2006), but it also does not 
rely on knowing the identity of the most important instruments. They show through simulation 
examples that instrument selection via boosting works well in the designs they consider but 
do not provide formal results. Bai and Ng (2009b) also expressly mention the idea of using 
the Lasso method for instrument selection, though they focus their analysis on the boosting 
method. Our paper complements their analysis by providing a formal set of conditions under 
which Lasso variable selection will provide good first-stage predictions and providing theoretical 
estimation and inference results for the resulting IV estimator. One of our theoretical results 
for the IV estimator is also sufficiently general to cover the use of any other first-stage variable 
selection procedure, including boosting, that satisfies a set of provided rate conditions. Caner 

Approximate sparsity should be understood to be relative to a given structure defined by the set of instruments 
considered. Allowing for principle components or ridge fits among the potential regressors considerably expands 
the applicability of the approximately sparse framework. 
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(2009) considers estimation by penalizing the GMM criterion function by the £ 7 -norm of the 
coefficients for < 7 < f . The analysis of Caner (2009) assumes that the number of parameters 
p is fixed in relation to the sample size, and so it is complementary to our approach where 
we allow p —7- 00 as n — >■ 00. Other uses of Lasso in econometrics include Bai and Ng (2008), 
Belloni, Chernozhukov, and Hansen (2011b), Brodie, Daubechies, Mol, Giannone, and Loris 
(2009), DeMiguel, Garlappi, Nogales, and Uppal (2009), Huang, Horowitz, and Wei (2010), 
Knight (2008), and others. An introductory treatment of this topic is given in Belloni and 
Chernozhukov (2011b), and Belloni, Chernozhukov, and Hansen (2011a) provides a review of 
Lasso targeted at economic applications. 

Our paper is also related to other shrinkage-based approaches to dealing with many instru- 
ments. Chamberlain and Imbens (2004) considers IV estimation with many instruments using 
a shrinkage estimator based on putting a random coefficients structure over the first-stage co- 
efficients in a homoscedastic setting. In a related approach, Okui (2010) considers the use of 
ridge regression for estimating the first-stage regression in a homoscedastic framework where 
the instruments may be ordered in terms of relevance. Okui (2010) derives the asymptotic dis- 
tribution of the resulting IV estimator and provides a method for choosing the ridge regression 
smoothing parameter that minimizes the higher-order asymptotic mean-squared-error (MSE) of 
the IV estimator. These two approaches are related to the approach we pursue in this paper in 
that both use shrinkage in estimating the first-stage but differ in the shrinkage methods they 
use. Their results are also only supplied in the context of homoscedastic models. Donald and 
Newey (2001) consider a variable selection procedure that minimizes higher-order asymptotic 
MSE which relies on a priori knowledge that allows one to order the instruments in terms of 
instrument strength. Our use of Lasso as a variable selection technique does not require any a 
priori knowledge about the identity of the most relevant instruments and so provides a useful 
complement to Donald and Newey (2001) and Okui (2010). Carrasco (2012) provides an in- 
teresting approach to IV estimation with many instruments based on directly regularizing the 
inverse that appears in the definition of the 2SLS estimator; see also Carrasco and Tchuente 
Nguembu (2012). Carrasco (2012) considers three regularization schemes, including Tikhohov 
regularization which corresponds to ridge regression, and shows that the regularized estimators 
achieve the semi-parametric efficiency bound under some conditions. Carrasco (2012) 's approach 
implicitly uses ^-norm penalization and hence differs from and complements our approach. A 
valuable feature of Carrasco (2012) is the provision of a data-dependent method for choosing 
the regularization parameter based on minimizing higher-order asymptotic MSE following Don- 
ald and Newey (2001) and Okui (2010). Finally, in work that is more recent that the present 
paper, Gautier and Tsybakov (2011) consider the important case where the structural equation 
in an instrumental variables model is itself very high-dimensional and propose a new estima- 
tion method related to the Dantzig selector and the square-root-Lasso. They also provide an 
interesting inference method which differs from the one we consider. 
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Notation. In what follows, we work with triangular array data {(zi >n ,i= l,...,n),n = 
1,2,3,...} defined on some common probability space (Q,A,~P). Each z^ n = (y[ n , x\ n , d\ n )' 
is a vector, with components defined below in what follows, and these vectors are i.n.i.d. 
independent across i, but not necessarily identically distributed. The law P n of {zi jTl , i = 1, n} 
can change with n, though we do not make explicit use of P Tt . Thus, all parameters that 
characterize the distribution of {zi >n ,i = l,...,n} are implicitly indexed by the sample size 
n, but we omit the index n in what follows to simplify notation. We use triangular array 
asymptotics to better capture some finite-sample phenomena and to retain the robustness of 
conclusions to perturbations of the data-generating process. We also use the following empirical 
process notation, E n [/] := E n [f( Zi )] := £™=i /(*)/«, and G„(/) := £?=i(/(*i) - E[f( Zi )])/^i. 
Since we want to deal with i.n.i.d. data, we also introduce the average expectation operator: 
E[/] := EE n [/] = EE n [f( Zi )] = £? =1 E[f( Zi )]/n. The £ 2 -norm is denoted by || • || 2 , and the 
£o-norm, || • ||o, denotes the number of non-zero components of a vector. We use || • ||oo to denote 
the maximal element of a vector. The empirical L 2 (P n ) norm of a random variable Wi is defined 
as || Wj||2,n := \J~E n \Wf]. When the empirical L 2 (P n ) norm is applied to regressors fi,---,f p and 
a vector 8 € W, H/^l^n, it is called the prediction norm. Given a vector S £ W and a set of 
indices T C {1, . . . ,p}, we denote by St the vector in which Stj = Sj if j £ T, Stj = if j £ T. 
We also denote T c := {1, 2, . . . ,p} \T. We use the notation (a)+ = max{a, 0}, a V6 = max{a, 6} 
and a A b = min{a, b}. We also use the notation a < b to denote a ^ cb for some constant c > 
that does not depend on n; and a <p b to denote a = Op (b). For an event E, we say that E wp 
— > 1 when E occurs with probability approaching one as n grows. We say X n =d Y n + op(l) to 
mean that X n has the same distribution as Y n up to a term op(l) that vanishes in probability. 

2. Sparse Models and Methods for Optimal Instrumental Variables 

In this section of the paper, we present the model and provide an overview of the main results. 
Sections 3 and 4 provide a technical presentation that includes a set of sufficient regularity 
conditions, discusses their plausibility, and establishes the main formal results of the paper. 

2.1. The IV Model and Statement of The Problem. The model is yi = d^ao + ej where 
ojo denotes the true value of a vector-valued parameter a. yi is the response variable, and 
d{ is a finite fe^-vector of variables whose first k e elements contain endogenous variables. The 
disturbance 6j obeys for all i (and n): 

E[ei\xi] = 0, 

where /c r -vector of instrumental variables. 

As a motivation, suppose that the structural disturbance is conditionally homoscedastic, 
namely, for all i, E[e 2 |xj] = a 2 . Given a fc^-vector of instruments Aixi), the standard IV estimator 
of a is given by a = (E n [A(x i )(i-])~ 1 IE n [A(xi)yi], where {(xi, di,yi),i = l,...,n} is an i.i.d. 
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sample from the IV model above. For a given A(xi), \/n(a — qo) =d N(Q, Qq^^^oQq 1 ') + »p(l), 
where Qo = EL4(xj)(^] and = a 2 E[A(xi)A(xi)'] under standard conditions. Setting A(xi) = 
D(xi) = E[c/j|xj] minimizes the asymptotic variance which becomes 

A* = a 2 {E[D(x l )D(x t )'}}-\ 

the semi-parametric efficiency bound for estimating ao; see Amemiya (1974), Chamberlain 
(1987), and Newey (1990). In practice, the optimal instrument D(xi) is an unknown func- 
tion and has to be estimated. In what follows, we investigate the use of sparse methods - 
namely Lasso and Post-Lasso - for use in estimating the optimal instruments. The resulting IV 
estimator is asymptotically as efficient as the infeasible optimal IV estimator above. 

Note that if di contains exogenous components Wi, then di = (dn, d^, w'A' where the first 
k e variables are endogenous. Since the rest of the components wi are exogenous, they appear 
in Xi = (w'^x'j)'. It follows that Di := D(xi) := E[dj|xj] = (E[eZi|xj], E[^ e |xj], w'A'; i.e. the 
estimator of Wi is simply W{. Therefore, we discuss estimation of the conditional expectation 
functions: 

Du := Di(xi) := E[di\xi], I = 1, k e . 

In what follows, we focus on the strong instruments case which translates into the assumption 
that Q = E[D(xi)D(xiY] has eigenvalues bounded away from zero and from above. We also 
present an inference procedure that remains valid in the absence of strong instruments which is 
related to Anderson and Rubin (1949) and Staiger and Stock (1997) but allows for p » n. 

2.2. Sparse Models for Optimal Instruments and Other Conditional Expectations. 

Suppose there is a very large list of instruments, 

fi ■= (fil, -, Up)' ■= (fl(Xi),:.,fp(Xi))', 

to be used in estimation of conditional expectations Di(xi), I = 1, ...,k e , where the number of 
instruments p is possibly much larger than the sample size n. 

For example, high- dimensional instruments fi could arise as any combination of the following 
two cases. First, the list of available instruments may simply be large, in which case fi = Xi as 
in e.g. Amemiya (1974) and Bekker (1994). Second, the list fi could consist of a large number 
of series terms with respect to some elementary regressor vector Xi\ e.g., fi could be composed 
of B-splines, dummies, polynomials, and various interactions as in Newey (1990) or Hahn (2002) 
among others. We term the first example the many instrument case and the second example the 
many series instrument case and note that our formulation does not require us to distinguish 
between the two cases. We mainly use the term "series instruments" and contrast our results 
with those in the seminal work of Newey (1990) and Hahn (2002), though our results are not 
limited to canonical series regressors as in Newey (1990) and Hahn (2002). The most important 
feature of our approach is that by allowing p to be much larger than the sample size, we are able 
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to consider many more series instruments than in Newey (1990) and Hahn (2002) to approximate 
the optimal instruments. 

The key assumption that allows effective use of this large set of instruments is sparsity. To 
fix ideas, consider the case where Di{ Xj^\ IS cl function of only s <C n instruments: 

Di(xi) = flfao, l = l,...,k e , 

maxi^/^ HAollo = maxi^ fce Yfj=i HPiOj ¥= 0} < s < n. 

This simple sparsity model generalizes the classic parametric model of optimal instruments of 
Amemiya (1974) by letting the identities of the relevant instruments T\ = support(/3^o) = {j £ 
{1, . . . ,p} : \/3ioj\ > 0} be unknown. 



The model given by (2.1 ) is unrealistic in that it presumes exact sparsity. We make no formal 
use of this model, but instead use a much more general approximately sparse or nonparametric 
model: 

Condition AS. (Approximately Sparse Optimal Instrument). Each optimal instru- 
ment function Di(xi) is well-approximated by a function of unknown s 1 instruments: 

Di(xi) = flfiio + ai(xi), 1 = 1, ...,k e , k e fixed, 

maxi^ is g fce HAollo < s = o(n), max K ^ fce \& n ai(xi) 2 ] l l 2 < c s < P s/sjn. 

Condition AS is the key assumption. It requires that there are at most s terms for each 
endogenous variable that are able to approximate the conditional expectation function Di(xi) 
up to approximation error ai(xi) chosen to be no larger than the conjectured size \fs~Jn of the 
estimation error of the infeasible estimator that knows the identity of these important variables, 
the "oracle estimator." In other words, the number s is defined so that the approximation error 
is of the same order as the estimation error, y/s/n, of the oracle estimator. Importantly, the 
assumption allows the identity 

Ti = support (Ao) 
to be unknown and to differ for I = 1, . . . , k e . 

For a detailed motivation and discussion of this assumption, we refer the reader to Belloni, 
Chernozhukov, and Hansen (2011a). Condition AS generalizes the conventional series approxi- 
mation of optimal instruments in Newey (1990, 1997) and Hahn (2002) by letting the identities of 
the most important s series terms T\ be unknown. The rate y/s/n generalizes the rate obtained 
with the optimal number s of series terms in Newey (1990) for estimating conditional expec- 
tation by not relying on knowledge of what s series terms to include. Knowing the identities 
of the most important series terms is unrealistic in many examples. The most important series 
terms need not be the first s terms, and the optimal number of series terms to consider is also 
unknown. Moreover, an optimal approximation could come from the combination of completely 
different bases e.g by using both polynomials and B-splines. 
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Lasso and Post-Lasso use the data to estimate the set of the most relevant series terms in 
a manner that allows the resulting IV estimator to achieve good performance if a key growth 
condition, 

s 2 log 2 (pVn) 
n 

holds along with other more technical conditions. The growth condition requires the optimal 
instruments to be sufficiently smooth so that a small (relative to n) number of series terms can 
be used to approximate them well. The use of a small set of instruments ensures that the impact 
of first-stage estimation on the IV estimator is asymptotically negligible. We can weaken this 
condition to slog(p V n) = o(n) by using the sample-splitting idea from the many instruments 
literature. 

2.3. Lasso-Based Estimation Methods for Optimal Instruments and Other Condi- 
tional Expectation Functions. Let us write the first-stage regression equations as 

d u = Di(xi) + v ih E[v u \xi\ = Q, l = l,...,k e . (2.3) 

Given the sample {(xi,du,l = l,...,k e ),i = l,...,n}, we consider estimators of the optimal 
instrument Du := Di(xi) that take the form 

Dv:=D l {x i ) = f i p l , l = l,...,k e , 

where fa is the Lasso or Post-Lasso estimator obtained by using du as the dependent variable 
and fi as regressors. 

Consider the usual least squares criterion function: 

W) :=E n [(du-m 2 ]. 
The Lasso estimator is defined as a solution of the following optimization program: 

A L GargminQK/3) + -||T^||i (2.4) 

/3£Rp n 

where A is the penalty level and Y; = diag^i, ••-,7/ P ) is a diagonal matrix specifying penalty 
loadings. 

Our analysis will first employ the following "ideal" penalty loadings: 

T? = diagCyO, ...,%), % = y/En^vl], j = 1, ...,p. 

The ideal option is not feasible but leads to rather sharp theoretical bounds on estimation risk. 
This option is not feasible since vu is not observed. In practice, we estimate the ideal loadings by 
first using conservative penalty loadings and then plugging-in the resulting estimated residuals 
in place of vu to obtain the refined loadings. This procedure could be iterated via Algorithm 
A.l stated in the appendix. 
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The idea behind the ideal penalty loading is to introduce self-normalization of the first- 
order condition of the Lasso problem by using data-dependent penalty loadings. This self- 
normalization allows us to apply moderate deviation theory of Jing, Shao, and Wang (2003) for 
self-normalized sums to bound deviations of the maximal element of the score vector 

S t = 2E n [(T V^fiVu] 

which provides a representation of the estimation noise in the problem. Specifically, the use of 
self-normalized moderate deviation theory allows us to establish that 

n max H^U < 2$~ 1 (1 - l/{2k e p)) J > 1 - 7 + o(l), (2.5) 

lsg/^fc e J 

from which we obtain sharp convergence results for the Lasso estimator under non-Gaussianity 
and heteroscedasticity. Without using these loadings, we may not be able to achieve the same 
sharp rate of convergence. It is important to emphasize that our construction of the penalty 
loadings for Lasso is new and differs from the canonical penalty loadings proposed in Tibshirani 
(1996) and Bickel, Ritov, and Tsybakov (2009). Finally, to insure the good performance of 
the Lasso estimator, one needs to select the penalty level X/n to dominate the noise for all k e 
regression problems simultaneously; i.e. the penalty level should satisfy 

X/n ^ c max \\Si\lJ) -»• 1, (2.6) 



for some constant c > 1. The bound (2.5) suggests that this can be achieved by selecting 

X = c2Vn$-\l--//(2k e p)), with 7^0, log(l/ 7 ) < log(pVn), (2.7) 



which implements (2.6). Our current recommendation is to set the confidence level 7 



0.1/ log (p V n) and the constant c = l.lj^] 



The Post-Lasso estimator is defined as the ordinary least square regression applied to the 
model I\ D T\ where T\ is the model selected by Lasso: 

fi = support (/%,) = {j E {l,...,p} : \Pi L j\>0}, l = l,...,k e . 

The set Ii can contain additional variables not selected by Lasso, but we require the number of 
such variables to be similar to or smaller than the number selected by Lasso. The Post-Lasso 
estimator /3/pl is 

ApLGarg min Q0), I = 1, k e . (2.8) 

'l 

In words, this estimator is ordinary least squares (OLS) using only the instruments/regressors 
whose coefficients were estimated to be non-zero by Lasso and any additional variables the 
researcher feels are important despite having Lasso coefficient estimates of zero. 



^ We note that there is not much room to change c. Theoretically, we require c > 1, and finite-sample 
experiments show that increasing c away from c = 1 worsens the performance. Hence a value slightly above unity, 
namely c = 1.1, is our current recommendation. The simulation evidence suggests that setting c to any value 
near 1, including c = 1, does not impact the result noticeably. 
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Lasso and Post-Lasso are motivated by the desire to predict the target function well with- 
out overfitting. Clearly, the OLS estimator is not consistent for estimating the target function 
when p > n. Some approaches based on BIC-penalization of model size are consistent but 
computationally infeasible. The Lasso estimator of Tibshirani (1996) resolves these difficulties 
by penalizing model size through the sum of absolute parameter values. The Lasso estimator 
is computationally attractive because it minimizes a convex function. Moreover, under suit- 
able conditions, this estimator achieves near-optimal rates in estimating the regression function 
Di(xi). The estimator achieves these rates by adapting to the unknown smoothness or sparsity 
of Di(xi). Nonetheless, the estimator has an important drawback: The regularization by the 
£i-norm employed in (2.4) naturally lets the Lasso estimator avoid overfitting the data, but it 



also shrinks the estimated coefficients towards zero causing a potentially significant bias. The 
Post-Lasso estimator is meant to remove some of this shrinkage bias. If model selection by Lasso 
works perfectly - that is, if it selects exactly the "relevant" instruments - then the resulting 
Post-Lasso estimator is simply the standard OLS estimator using only the relevant variables. 
In cases where perfect selection does not occur, Post-Lasso estimates of coefficients will still 
tend to be less biased than Lasso. We prove the Post-Lasso estimator achieves the same rate of 
convergence as Lasso, which is a near-optimal rate, despite imperfect model selection by Lasso. 

The introduction of self-normalization via the penalty loadings allows us to contribute to the 
broad Lasso literature cited in the introduction by showing that under possibly heteroscedastic 
and non-Gaussian errors the Lasso and Post-Lasso estimators obey the following near-oracle 
performance bounds: 



max Az - Az 2,n <p \/ and max # - fro i <p \ .(2.9) 

l^ZsJfce V n i<l<k e V n 



The performance bounds in (2.9) are called near-oracle because they coincide up to a \/logp 



factor with the bounds achievable when the the ideal series terms T\ for each of the k e regressions 



equations in (2.2) are known. Our results extend those of Bickel, Ritov, and Tsybakov (2009) 
for Lasso with Gaussian errors and those of Belloni and Chernozhukov (2012) for Post-Lasso 
with Gaussian errors. Notably, these bounds are as sharp as the results for the Gaussian case 
under the weak condition logp = o(n 1 / 3 ). They are also the first results in the literature that 
allow for data-driven choice of the penalty level. 



It is also useful to contrast the rates given in (2.9) with the rates available for nonparametri- 
cally estimating conditional expectations in the series literature; see, for example, Newey (1997). 
Obtaining rates of convergence for series estimators relies on approximate sparsity just as our 
results do. Approximate sparsity in the series context is typically motivated by smoothness 
assumptions, but approximate sparsity is more general than typical smoothness assumptions^] 
The standard series approach postulates that the first K series terms are the most important for 



^See, e.g., Belloni, Chernozhukov, and Hansen (2011a) and Belloni, Chernozhukov, and Hansen (2011b) for 
detailed discussion of approximate sparsity. 
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approximating the target regression function D^. The Lasso approach postulates that s terms 
from a large number p of terms are important but does not require knowledge of the identity of 
these terms or the number of terms, s, needed to approximate the target function well-enough 
that approximation errors are small relative to estimation error. Lasso methods estimate both 
the optimal number of series terms s as well as the identities of these terms and thus automati- 
cally adapt to the unknown sparsity (or smoothness) of the true optimal instrument (conditional 
expectation). This behavior differs sharply from standard series procedures that do not adapt 
to the unknown sparsity of the target function unless the number of series terms is chosen by a 
model selection method. Lasso-based methods may also provide enhanced approximation of the 
optimal instrument by allowing selection of the most important terms from a among a set of very 
many series terms with total number of terms p^> K that can be much larger than the sample 
size|^] For example, a standard series approach based on K terms will perform poorely when the 
terms m + 1, m + 2,...,m + j are the most important for approximating the optimal instrument 
for any K < m. On the other hand, lasso-based methods will find the important terms as long 
as p > m + j which is much less stringent than what is required in usual series approaches since 
p can be very large. This point can also be made using the array asymptotics where the model 
changes with n in such a way that the important series terms are always missed by the first 
K — ¥ oo terms. Of course, the additional flexibility allowed for by Lasso-based methods comes 
with a price, namely slowing the rate of convergence by \f\ogp relative to the usual series rates. 

2.4. The Instrumental Variable Estimator based on Lasso and Post-Lasso constructed 
Optimal Instrument. Given Condition AS, we take advantage of the approximate sparsity 
by using Lasso and Post-Lasso methods to construct estimates of Di(x{) of the form 

Di(xi) = f-Pi, l = l,...,k e , 

and then set 

A = (Di(xi), ...,D ke {xi),w§ . 
The resulting IV estimator takes the form 

a = E n [D i d! i \- 1 E n [D i y i ]. 

The main result of this paper is to show that, despite the possibility of p being very large, Lasso 
and Post-Lasso can select a set of instruments to produce estimates of the optimal instruments 
A such that the resulting IV estimator achieves the efficiency bound asymptotically: 

v^(S-ao) = d iV(0,A*) + o P (l). 

^We can allow for p>n for series formed with orthonormal bases with bounded components, such as trigono- 
metric bases, but further restrictions on the number of terms apply if bounds on components of the series are 
allowed to increase with the sample size. For example, if we work with B-spline series terms, we can only consider 
p — o(n) terms. 
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The estimator matches the performance of the classical/standard series-based IV estimator of 
Newey (1990) and has additional advantages mentioned in the previous subsection. We also show 
that the IV estimator with Lasso-based optimal instruments continues to be root-n consistent 
and asymptotically normal in the presence of heteroscedasticity: 

v^(S - oo) = d N(0, Q^nQ- 1 ) + o P (l), (2.10) 

where Q := E[ejD(xi)D(xi)'] and Q := E[D(xi)D(xi)']. A consistent estimator for the asymp- 
totic variance is 



Q-^Q- 1 , fi := E n \gD(x i )D(x i )'] i Q := E n [D(x i )D(x i ) r \, 



(2.11) 



where e"j := yi — d'.a, i = 1, . . . , n. Using (2.11 ) we can perform robust inference. 



We note that our result (2.10) for the IV estimator do not rely on the Lasso and Lasso- 



based procedure specifically. We provide the properties of the IV estimator for any generic 



sparsity-based procedure that achieves the near-oracle performance bounds (2.9). 



We conclude by stressing that our result (2.10) does not rely on perfect model selection. Per- 



fect model selection only occurs in extremely limited circumstances that are unlikely to occur 
in practice. We show that model selection mistakes do not affect the asymptotic distribution 
of the IV estimator a under mild regularity conditions. The intuition is that the model selec- 
tion mistakes are sufficiently small to allow the Lasso or Post-Lasso to estimate the first stage 
predictions with a sufficient, near-oracle accuracy, which translates to the result above. Using 



analysis like that given in Belloni, Chernozhukov, and Hansen (2011b), the result (2.10) can be 
shown to hold over models with strong optimal instruments which are uniformly approximately 
sparse. We also offer an inference test procedure in Section 4.2 that remains valid in the absence 
of a strong optimal instrument, is robust to many weak instruments, and can be used even if 
p^> n. This procedure could also be shown to be uniformly valid over a large class of models. 



3. Results on Lasso and Post-Lasso Estimation of Conditional Expectation 
Functions under Heteroscedastic, Non-Gaussian Errors 

In this section, we present our main results on Lasso and Post-Lasso estimators of conditional 
expectation functions under non-classical assumptions and data-driven penalty choices. The 
problem we analyze in this section has many applications outside the IV framework of the 
present paper. 

3.1. Regularity Conditions for Estimating Conditional Expectations. The key condi- 
tion concerns the behavior of the empirical Gram matrix E n [/j/|]. This matrix is necessarily 
singular when p > n, so in principle it is not well-behaved. However, we only need good behavior 
of certain moduli of continuity of the Gram matrix. The first modulus of continuity is called the 
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restricted eigenvalue and is needed for Lasso. The second modulus is called the sparse eigenvalue 
and is needed for Post-Lasso. 

In order to define the restricted eigenvalue, first define the restricted set: 

A CtT = {5£R p : \\5t4i ^ C\\S T \\x, S ^ 0}. 
The restricted eigenvalue of a Gram matrix M = E„[/j/|] takes the form: 



8'M6 

6eA^\T\^ \\S T \\i 



Kq(M) := A min m _ s-^—^. (3.12) 



This restricted eigenvalue can depend on n, but we suppress the dependence in our notation. 

In making simplified asymptotic statements involving the Lasso estimator, we will invoke the 
following condition: 

Condition RE. For any C > 0, there exists a finite constant k > 0, which does not depend 
on n but may depend on C, such that the restricted eigenvalue obeys Kc*(E n [/j/ ? ']) ^ k with 
probability approaching one as n -4 oo. 



The restricted eigenvalue (3.12) is a variant of the restricted eigenvalues introduced in Bickel, 
Ritov, and Tsybakov (2009) to analyze the properties of Lasso in the classical Gaussian regression 
model. Even though the minimal eigenvalue of the empirical Gram matrix K n [fif-] is zero 
whenever p ^ n, Bickel, Ritov, and Tsybakov (2009) show that its restricted eigenvalues can be 
bounded away from zero. Lemmas [T] and [2] below contain sufficient conditions for this. Many 
other sufficient conditions are available from the literature; see Bickel, Ritov, and Tsybakov 
(2009). Consequently, we take restricted eigenvalues as primitive quantities and Condition RE 
as a primitive condition. 

Comment 3.1 (On Restricted Eigenvalues). In order to gain intuition about restricted eigen- 
values, assume the exactly sparse model, in which there is no approximation error. In this 
model, the term S stands for a generic deviation between an estimator and the true param- 
eter vector (3q. Thus, the restricted eigenvalue represents a modulus of continuity between a 
penalty-related term and the prediction norm, which allows us to derive the rate of convergence. 
Indeed, the restricted eigenvalue bounds the minimum change in the prediction norm induced 
by a deviation 5 within the restricted set £±c,T relative to the norm of 5t, the deviation on the 
true support. Given a specific choice of the penalty level, the deviation of the estimator belongs 
to the restricted set, making the restricted eigenvalue relevant for deriving rates of convergence. 

In order to define the sparse eigenvalues, let us define the m-sparse subset of a unit sphere as 

A(m) = {5£R p : \\6\\ ^ m, \\6\\ 2 = 1}, 
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and also define the minimal and maximal m-sparse eigenvalue of the Gram matrix M = K n [fif-] 
as 

<j>mm(m)(M) = min S'MS and </> max (m)(M) = max S MS. 

<5eA(m) <5eA(m) 

To simplify asymptotic statements for Post-Lasso, we use the following condition: 

Condition SE. For any C > 0, there exist constants < k' < k" < oo, which do not 
depend on n but may depend on C , such that with probability approaching one, as n — > oo ; 

«' < 0min(Cs)(E n [/ i /;]) < 0max (Cs) (E„ [/j f-] ) < k" . 

Condition SE requires only that certain "small" mxm submatrices of the large pxp empirical 
Gram matrix are well-behaved, which is a reasonable assumption and will be sufficient for 
the results that follow. Condition SE implies Condition RE by the argument given in Bickel, 
Ritov, and Tsybakov (2009). The following lemmas show that Conditions RE and SE are 
plausible for both many-instrument and many series- instrument settings. We refer to Belloni 
and Chernozhukov (2012) for proofs; the first lemma builds upon results in Zhang and Huang 
(2008) and the second builds upon results in Rudelson and Vershynin (2008). The lemmas could 
also be derived from Rudelson and Zhou (2011). 

Lemma 1 (Plausibility of RE and SE under Many Gaussian Instruments). Suppose fi, i = 
l,...,n, are i.i.d. zero-mean Gaussian random vectors. Further suppose that the population 
Gram matrix E[/j/|] has slogn-sparse eigenvalues bounded from above and away from zero 
uniformly in n. Then if slog n = o(n/ log p), Conditions RE and SE hold. 

Lemma 2 (Plausibility of RE and SE under Many Series Instruments). Suppose fi, i = 1, . . . , n, 
are i.i.d. bounded zero-mean random vectors with ||/i||oo ^ Kb a.s. Further suppose that the 
population Gram matrix E[/i/ 4 -] has slogn-sparse eigenvalues bounded from above and away 
from zero uniformly in n. Then if K B s log 2 (n) log 2 (slog n) log(p V n) = o(n), Conditions RE 
and SE hold. 

In the context of i.i.d. sampling, a standard assumption in econometric research is that the 
population Gram matrix K[fif-] has eigenvalues bounded from above and below, see e.g. Newey 
(1997). The lemmas above allow for this and more general behavior, requiring only that the 
sparse eigenvalues of the population Gram matrix ~E[fif-] are bounded from below and from 
above. The latter is important for allowing functions fi to be formed as a combination of 
elements from different bases, e.g. a combination of B-splines with polynomials. The lemmas 
above further show that the good behavior of the population sparse eigenvalues translates into 
good behavior of empirical sparse eigenvalues under some restrictions on the growth of s in 
relation to the sample size n. For example, if p grows polynomially with n and the components 
of technical regressors are uniformly bounded, Lemma ^1 holds provided s = o(n/ log 5 n). 
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We also impose the following moment conditions on the reduced form errors vu and regressors 
fi, where we let d a := du - E[du]. 



Condition RF. (i) max Kfee , Kp E[^] + E[|/^|] + l/E[/£i$ < 1, (ii) max E[|/^|] < 

K n , (Hi) K% log 3 (pVn) = o(n) and slog(pVn) = o(n), (iv) maxj^„j^ p /?-[s log(p Vn)]/n ->p 
and max z<fce , Kp |(E n - E)[$t$| + |(E n - E)[/?4]| ^ P 0. 

We emphasize that the conditions given above are only one possible set of sufficient conditions, 
which are presented in a manner that reduces the complexity of the exposition. 

The following lemma shows that the population and empirical moment conditions appearing 
in Condition RF are plausible for both many-instrument and many series-instrument settings. 
Note that we say that a random variable gi has uniformly bounded conditional moments of order 
K if for some positive constants < B\ < B<i < 00: 







~\9i\ k 





$J B2 with probability 1, for k = 1, . . . , K, i = 1, . . . , n. 



Lemma 3 (Plausibility of RF). 1. If the moments E[cZ^] and Ekv^] are bounded uniformly in 
l^l^k e and inn, the regressors fi obey max 1<: , <?1 E n [/?-] <p 1 and maxi^^ ni i <J<p /?■ sl °s^ Vn ) 
— 7-p 0, Conditions RF(i)-(iii) imply Condition RF (iv). 2. Suppose that {(fi,di,Vi),i = l,...n} 
are i.i.d. vectors, and that du and vu have uniformly bounded conditional moments of order 
4 uniformly in I = 1, . . . ,k e . (1) If the regressors fi are Gaussian as in Lemma [7J Condition 
RF(iii) holds, and s log 2 (p V n) jn — > then Conditions RF(i),(ii) and (iv) hold. (2) If the 
regressors fi have bounded entries as in Lemma^ then Conditions RF(i),(ii) and (iv) hold 
under Condition RF(iii). 

3.2. Results on Lasso and Post-Lasso for Estimating Conditional Expectations. We 



consider Lasso and Post-Lasso estimators defined in equations (2.4) and (2.8) in the system 



of k e nonparametric regression equations (2.3) with non-Gaussian and heteroscedastic errors. 



These results extend the previous results of Bickel, Ritov, and Tsybakov (2009) for Lasso and 
of Belloni and Chernozhukov (2012) for Post-Lasso with classical i.i.d. errors. In addition, we 
account for the fact that we are simultaneously estimating k e regressions and account for the 
dependence of our results on k e . 

The following theorem presents the properties of Lasso. Let us call asymptotically valid any 
penalty loadings that obey a.s. 

£Ti < f j < uT?, (3.13) 

with < £ ^ 1 ^ u such that I — >p 1 and u — >p v! with it' ^ 1 . The penalty loadings constructed 
by Algorithm A.l satisfy this condition. 

Theorem 1 (Rates for Lasso under Non-Gaussian and Heteroscedastic Errors). Suppose that in 
the regression model \2.<fy Conditions AS and RF hold. Suppose the penalty level is specified as 
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in (2.1), and consider any asymptotically valid penalty loadings T, for example, penalty loadings 
constructed by Algorithm A. I stated in Appendix A. Then, the Lasso estimator A = Al and the 
Lasso fit Du = f-PiL, I = !,■■■, k e , satisfy 



II n n II < 1 sl °g(W7) , I, £ a n < 1 s 2 log(k e p/-f) 
max \\D U - Du 2 n < P — \ and max \\pi - pi Q \\\ <v -, r~\ • 

wftere C = max {||T°|| 0O ||(TP)- 1 || 0O }(«c + l)/(^c - 1) and rc e = K^GM/i/iD- 



The theorem provides a rate result for the Lasso estimator constructed specifically to deal 
with non-Gaussian errors and heteroscedasticity. The rate result generalizes, and is as sharp as, 
the rate results of Bickel, Ritov, and Tsybakov (2009) obtained for the homoscedastic Gaussian 
case. This generalization is important for real applications where non-Gaussianity and het- 
eroscedasticity are ubiquitous. Note that the obtained rate is near-optimal in the sense that 
if we happened to know the model Ti, i.e. if we knew the identities of the most important 
variables, we would only improve the rate by the logp factor. The theorem also shows that the 
data-driven penalty loadings defined in Algorithm A.l are asymptotically valid. 

The following theorem presents the properties of Post-Lasso which requires a mild assumption 
on the number of additional variables in the set Li, I = 1, . . . , k e . We assume that the size of 
these sets are not substantially larger than the model selected by Lasso, namely, a.s. 

|jJ\T,|<lV|T,|, 1 = 1,..., k e . (3.14) 
Theorem 2 (Rates for Post-Lasso under Non-Gaussian and Heteroscedastic Errors). Suppose 



that in the regression model (2.3) Conditions AS and RF hold. Suppose the penalty level for the 



Lasso estimator is specified as in (2.1), that Lasso's penalty loadings T are asymptotically valid, 



and the sets of additional variables obey (3.14)- Then, the Post-Lasso estimator = fiipi and 



the Post-Lasso fit Du = f[fi\pL, I = 1, k e , satisfy 



il n n n < V sl °g( k eP/l) , ,,-s a ii <* M s 2 \og{k e p/>y) 

max ArAl 2,n<P — A/ and max A ~ Ao l ~P 7 yi\ , 

where fi 2 = min fe { ( /> max (A;)(E n [/ i /;])/0 min (A; + s) : k > 18C 2 S max (A : )(E n [/ i /;])/( K(? ) 2 } 
for C defined in Theorem^ 



The theorem provides a rate result for the Post-Lasso estimator with non-Gaussian errors and 
heteroscedasticity. The rate result generalizes the results of Belloni and Chernozhukov (2012) 
obtained for the homoscedastic Gaussian case. The Post-Lasso achieves the same near-optimal 
rate of convergence as Lasso. As stressed in the introductory sections, our analysis allows Lasso 
to make model selection mistakes which is expected generically. We show that these model 
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selection mistakes are small enough to allow the Post-Lasso estimator to perform as well as 
Lasso 

Rates of convergence in different norms can also be of interest in other applications. In par- 
ticular, the ^2-rate of convergence can be derived from the rate of convergence in the prediction 
norm and Condition SE using a sparsity result for Lasso established in Appendix [D] Below we 
specialize the previous theorems to the important case that Condition SE holds. 

Corollary 1 (Rates for Lasso and Post-Lasso under SE). Under the conditions of Theorem^ 
and Condition SE, the Lasso and Post-Lasso estimators satisfy 



II n n II < / s lQ gQ v n ) 
max \\D U - Du\\ 2 , n <p \ ; 



max ft-ft 2 <p V , max ft-fto i<p" 



The rates of convergence in the prediction norm and ^-norm are faster than the rate of 
convergence in the £i-norm which is typical of high dimensional settings. 

4. Main Results on IV Estimation 
In this section we present our main inferential results on instrumental variable estimators. 

4.1. The IV estimator with Lasso-based instruments. We impose the following moment 
conditions on the instruments, the structural errors, and regressors. 

Condition SM. (i) The eigenvalues of Q = ~E[D(xi)D(xi)'] are bounded uniformly from 
above and away from zero, uniformly in n. The conditional variance E[e||iCj] is bounded uni- 
formly from above and away from zero, uniformly in i and n. Given this assumption, without 
loss of generality, we normalize the instruments so that E[/^-e 2 ] = 1 for each 1 ^ j ^ p and for 
all n. (ii) For some q > 2 and q e > 2, uniformly in n, 

max EU,eif] + ^HAlUM 2 *] + E[|| All!] + E[|e<|*] + E[||di||«] < 1. 

(Hi) In addition to log 3 p = o(n), the following growth conditions hold: 

slog(pVn) 2/ s 2 log 2 (pVn) 2 2 

(a) n /<7e -> (6) > 0, c) max E n [f^] < P 1. 

n n isSjsSp 



Under further conditions stated in proofs, Post-Lasso can sometimes achieve a faster rate of convergence. In 
special cases where perfect model selection is possible, Post-Lasso becomes the so-called oracle estimator and can 
completely remove the logp factor. 
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Comment 4.1. (On Condition SM) Condition SM(i) places restrictions on the variation of 
the structural errors (e) and the optimal instruments (D(x)). The first condition about the 
variation in the optimal instrument guarantees that identification is strong; that is, it ensures 
that the conditional expectation of the endogenous variables given the instruments is a non- 
trivial function of the instruments. This assumption rules out non-identification in which case 
D(x) does not depend on x and weak-identification in which case D(x) would be local to a 
constant function. We present an inference procedure that remains valid without this condition 
in Section 4.2. The remaining restriction in Condition SM(i) requires that structural errors are 
boundedly heteroscedastic. Given this we make a normalization assumption on the instruments. 
This entails no loss of generality since this is equivalent to suitably rescaling the parameter 
space for coefficients fyo, I = 1, k e , via an isomorphic transformation. We use this normal- 
ization to simplify notation in the proofs but do not use it in the construction of the estimators. 
Condition SM(ii) imposes some mild moment assumptions. Condition SM(iii) strengthens the 
growth requirement slogp/n — > needed for estimating conditional expectations. However, the 
restrictiveness of Condition SM(iii)(a) rapidly decreases as the number of bounded moments of 
the structural error increases. Condition SM(iii)(b) indirectly requires the optimal instruments 
in Condition AS to be smooth enough that the number of unknown series terms s needed to ap- 
proximate them well is not too large. This condition ensures that the impact of the instrument 
estimation on the IV estimator is asymptotically negligible. This condition can be relaxed using 
the sample-splitting method. 



The following lemma shows that the moment assumptions in Condition SM (iii) are plausible 
for both many-instrument and many series-instrument settings. 

Lemma 4 (Plausibility of SM(iii)). Suppose that the structural disturbance ej has uniformly 
bounded conditional moments of order 4 uniformly in n and that s 2 log 2 (p V n) = o(n). Then 
Condition SM(iii) holds if (1) the regressors fi are Gaussian as in Lemma 1 or (2) the regressors 
fi are arbitrary i.i.d. vectors with bounded entries as in Lemma 2. 

The first result describes the properties of the IV estimator with the optimal IV constructed 
using Lasso or Post-Lasso in the setting of the standard model. The result also provides a 
consistent estimator for the asymptotic variance of this estimator under heteroscedasticity. 

Theorem 3 (Inference with Optimal IV Estimated by Lasso or Post-Lasso). Suppose that 
data (yi,Xi,di) are i.n.i.d. and obey the linear IV model described in Section 2. Suppose also 



that Conditions AS, RF, SM, (2.1) and (3.13) hold. To construct the estimate of the optimal 



instrument, suppose that Condition RE holds in the case of using Lasso or that Condition SE and 



( 3.14) hold in the case of using Post-Lasso. Then the IV estimator a, based on either Lasso or 
Post-Lasso estimates of the optimal instrument, is root-n consistent and asymptotically normal: 

(Q-^Q- 1 )- 1 / 2 ^^ - ao) N(0, 1), 
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for £2 := Fi{e 2 D(xi)D(xi)'] and Q := E[D(xi)D(xi)']. Moreover, the result above continues to 
hold with replaced by f2 := K n \e 2 D(xi)D(xi)'] for ej = y, — d^a, and Q replaced by Q := 
K n [D(xi)D(xiY]. In the case that the structural error t{ is homoscedastic conditional on X{, 
that is, E[ef\xi] = a 2 a.s. for all i = l,...,n, the IV estimator a based on either Lasso or 
Post-Lasso estimates of the optimal instrument is root-n consistent, asymptotically normal, and 
achieves the efficiency bound: (h*)^ 1 / 2 yjn(a — ao) — >■<£ N(Q,I) where A* := a 2 Q~ l . The result 
above continues to hold with A* replaced by A* := d 2 Q~ l , where Q := K n [D(xi)D(xi)'] and 
a 2 :=E n [( yi -<a) 2 ]. 



In the setting with homoscedastic structural errors the estimator achieves the efficiency bound 
asymptotically. In the case of heteroscedastic structural errors, the estimator does not achieve 
the efficiency bound, but we can expect it to be close to achieving the bound if heteroscedasticity 
is mild. 

The final result of this section extends the previous result to any IV-estimator with a generic 
sparse estimator of the optimal instruments. 

Theorem 4 (Inference with IV Constructed by a Generic Sparsity-Based Procedure). Suppose 
that conditions AS and SM hold, and suppose that the fitted values of the optimal instrument, 
Du = f'ifii, are constructed using any estimator A such that 

ll n n II <r /slog(pVrc) ,,-g- „ „ < I s 2 log(p V n) 
max \\Dii - Du \\ 2 ,„ . <p \/ and max A - Ao l <P \/ • (4.15) 

Then the conclusions reached in Theorem [| continue to apply. 



This result shows that the previous two theorems apply for any first-stage estimator that 
attains near-oracle performance given in (4.15). Examples of other sparse estimators covered by 
this theorem are Dantzig and Gauss-Dantzig (Candes and Tao, 2007), vLasso and post-\/Lasso 
(Belloni, Chernozhukov, and Wang, 2011a and 2011b), thresholded Lasso and Post-thresholded 
Lasso (Belloni and Chernozhukov, 2012), group Lasso and Post-group Lasso (Huang, Horowitz, 
and Wei, 2010; and Lounici, Pontil, Tsybakov, and van de Geer, 2010), adaptive versions of the 
above (Huang, Horowitz, and Wei, 2010), and boosting (Biihlmann, 2006). Verification of the 
near-oracle performance (4.15) can be done on a case by case basis using the best conditions 
in the literature|^| Our results extend to Lasso-type estimators under alternative forms of 
regularity conditions that fall outside the framework of Conditions RE and Conditions RF; all 
that is required is the near-oracle performance of the kind (4.15). 



^Post-l'i-penalized procedures have only been analyzed for the case of Lasso and VLasso; see Belloni and 
Chernozhukov (2012) and Belloni, Chernozhukov, and Wang (2011a). We expect that similar results carry over 
to other procedures listed above. 
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4.2. Inference when instruments are weak. When instruments are weak individually, Lasso 
may end up selecting no instruments or may produce unreliable estimates of the optimal instru- 
ments. To cover this case, we propose a method for inference based on inverting pointwise tests 
performed using a sup-score statistic defined below. The procedure is similar in spirit to Ander- 
son and Rubin (1949) and Staiger and Stock (1997) but uses a different statistics that is better 
suited to cases with very many instruments. In order to describe the approach, we rewrite the 
main structural equation as: 

Ui = d' ei ai + w-a 2 + £j, E[e;|xj] = 0, (4.16) 

where yi is the response variable, d e i is a vector of endogenous variables, Wi is a fc„,-vector 
of control variables, Xi = (z'^w^)' is a vector of elementary instrumental variables, and q is 
a disturbance such that ei,...,e n are i.n.i.d. conditional on X = [x' l7 ...,x' n ]. We partition 
di = (dei^w'j)'. The parameter of interest is a.\ G A\ C M fce . We use fi = P(xi), a vector which 
includes Wi, as technical instruments. In this subsection, we treat X as fixed; i.e. we condition 
on X. 

We would like to use a high-dimensional vector fi of technical instruments for inference on 
a\ that is robust to weak identification. In order to formulate a practical sup-score statistic, it 
is useful to partial-out the effect of Wi on the key variables. For an n- vector {ui,i = l,...,n}, 
define Ui = Ui — w' j E n [wiw' i ]~ l ~E n [wiUi], i.e. the residuals left after regressing this vector on 
{wi,i = 1, ...,n}. Hence §i, d e i, and are residuals obtained by partialling out controls W{. 
Also, let 

f t = (fn,...j ip y. (4.i7) 

In this formulation, we omit elements of Wi from fij since they are eliminated by partialling out. 
We then normalize these technical instruments so that 

En[yg] = l, 3 = 1,.., P- (4.18) 

The sup-score statistic for testing the hypothesis a\ = a takes the form 

Aa = max \nE n [(y t - d' et a)f tJ ]\/^E n [(y t -d' et aff^}. (4.19) 
As before, we apply self-normalized moderate deviation theory for self-normalized sums to obtain 

P(A Q1 ^ cv^* _1 (l - 7/2p)) > 1 - 7 + o(l). 

Therefore, we can employ A(l — 7) := c^fn<&~ l (l — r y/2p) for c > 1 as a critical value for testing 
a.\ = a using A a as the test-statistic. The asymptotic (1 — 7) - confidence region for ot\ is then 
given by C := {a G Ai : A a < A(l - 7)}. 
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The construction of confidence regions above can be given the following Inverse Lasso inter- 
pretation. Let 

~ A P I " 

p a £ arg mm E n [(& - d' ei a) - f[fif + - la 0j \ , laj = ^E n [(% - d' ei a) 2 f^}. 

If A = 2A(1 — 7), then C is equivalent to the region {a £ M fce : /3 a = 0}. In words, this confidence 
region collects all potential values of the structural parameter where the Lasso regression of 
the potential structural disturbance on the instruments yields zero coefficients on the instru- 
ments. This idea is akin to the Inverse Quantile Regression and Inverse Least Squares ideas in 
Chernozhukov and Hansen (2008a, 2008b). 

Below, we state the main regularity condition for the validity of inference using the sup-score 
statistic as well as the formal inference result. 



Condition SM2. Suppose that for each n the linear model (4-16) holds with a\ E Ai C 



'lk,. 



such that e n are i.n.i.d., X is fixed, and f±, f n are p-vectors of technical instruments de- 



fined in (4-17) and (4- IS). Suppose that (i) the dimension of u>i is k w and ||wj||2 ^ Cw such that 
VkwCw/ 'y/n —^0, (ii) the eigenvalues o/E n [iUjio£| are bounded away from zero and eigenvalues of 
Elefwiw'j] are bounded away from above, uniformly inn, (in) maxi^^pEdeil 3 !/^! 3 ] 1 / 3 ^^?/?-] 1 ' 2 ^ 
K n , and (iv) K^\og(p\/ n) = o(n 1 / 3 ). 

Theorem 5 (Valid Inference based on the Sup-Score Statistic). Let 7 G (0, 1) be fixed or, more 
generally, such that log(l/7) < log(pVn). Under Condition SM2, (1) in large samples, the 
constructed confidence set C contains the true value a\ with at least the prescribed probability, 
namely P(qi € C) ^ 1 — 7 — o(l). (2) Moreover, the confidence set C necessarily excludes a 
sequence of parameter value a, namely P(a 6 C) — > 0, if 

y/n/log(p/<y) |E n [(a - ax)'d e ifij\\ 
max — z . — >-p 00. 



The theorem shows that the confidence region C constructed above is valid in large samples 
and that the probability of including a false point a in C tends to zero as long as a is sufficiently 
distant from a± and instruments are not too weak. In particular, if there is a strong instrument, 
the confidence regions will eventually exclude points a that are further than \/log(p V n)/n away 
from ot\. Moreover, if there are instruments whose correlation with the endogenous variable is of 
greater order than ydog^ V n)/n, then the confidence regions will asymptotically be bounded. 



5. Further Inference and Estimation Results for the IV Model 



In this section we provide further estimation and inference results. We develop an overidenti- 
fication test which compares the IV-Lasso based estimates to estimates obtained using a baseline 
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set of instruments. We also combine the IV selection using Lasso with a sample-splitting tech- 
nique from the many instruments literature which allows us to relax the growth requirement on 
the number of relevant instruments. 

5.1. A Specification Test for Validity of Instrumental Variables. Here we develop a 
Hausman-style specification test for the validity of the instrumental variables. Let Ai = A{xi) 
be a baseline set of instruments, with d\m(Ai) ^ dim(a) = k a bounded. Let a be the baseline 
instrumental variable estimator based on these instruments: 

a = (E n [d i ^]E n [A i ^]- 1 E n [A i d;])- 1 E„[d i A , i ]E n [A^]~ 1 E n [>liyi]. 

If the instrumental variable exclusion restriction is valid, then the unsealed difference between 
this estimator and the IV estimator a proposed in the previous sections should be small. If the 
exclusion restriction is not valid, the difference between a and a should be large. Therefore, we 
can reject the null hypothesis of instrument validity if the difference is large. 

We formalize the test as follows. Suppose we care about R'a for some k x matrix R of 
rank(i?) = k. For instance, we might care only about the first k components of a, in which case 
R = [Ik 0] is a k x kd matrix that selects the first k coefficients of a. Define the estimand for a 

as 

and define the estimand of a as 

a a = ElDixiMxiY^EiDixM. 

The null hypothesis Hq is R(a — a a ) = and the alternative H a is R(a — a a ) ^ 0. We can form 
a test statistic 

J = ^(fi - a)' R! {KtR'Y 1 ^R{a - a) 

for a matrix £ defined below and reject Hq if J > c 7 where c 7 is the (1 — 7)-quantile of chi-square 
random variable with k degrees of freedom. The justification for this test is provided by the 
following theorem which builds upon the previous results coupled with conventional results for 
the baseline instrumental variable estimator F^l 

Theorem 6 (Specification Test). (1) Suppose the conditions of Theorem^ hold, that E[||j4j|||] 
is bounded uniformly in n for q > 4, and the eigenvalues of 

S := E[e](MAi - Q' 1 D{ Xi ))(M Ai - Q- l D(x t ))'] 

are bounded from above and below, uniformly in n, where 

M = (E^^E^^'J-^^^D-^^E^yl'J- 1 . 



The proof of this result is provided in a supplementary appendix. 
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Then ^/nXT 1 / 2 ^ - a)' N(0, 1) and J -> d x 2 (k), where 

S = E n [ef (M" X A - g- 1 ^^))^- 1 ^ - Q -1 5(s i ))'], 
/or q = yj - d\a, Q = E n [D(xi)D(xi)'}, and 

^ Suppose the conditions of Theorem^ hold with the exception that E[AjCj] = /or a// i = 
1, n and n, oui ||E[D(xj)ej]||2 «s bounded away from zero. Then J — >p oo. 

5.2. Split-sample IV estimator. The rate condition s 2 log 2 (pV n) = o(n) can be substan- 
tive and cannot be substantially weakened for the full-sample IV estimator considered above. 
However, we can replace this condition with the weaker condition that 

s log(p V n) = o(n) 

by employing a sample splitting method from the many instruments literature (Angrist and 
Krueger, 1995). Specifically, we consider dividing the sample randomly into (approximately) 
equal parts a and 6, with sizes n a = \n/2] and n b = n — n a . We use superscripts a and b for 
variables in the first and second subsample respectively. The index i will enumerate observations 
in both samples, with ranges for the index given by 1 ^ i ^ n a for sample a and 1 ^ i ^ n b for 
sample b. We can use each of the subsamples to fit the first stage via Lasso or Post-Lasso to obtain 
the first stage estimates /J* , k = a, b, and I = 1, . . . , k e . Then setting = ff ' ffi, 1 ^ i ^ n a , 
D h a = ff/3f, 1 ^ i ^ n b , D\ = (Df v . . . , D% k ,w^')', k = a,b, we form the IV estimates in the 
two subsamples: 

a a = E na [Dtd?]- 1 K na [D?yt] and a b = E nb [D b i df]- 1 M rib [D^]. 
Then we combine the estimates into one 

a ab = (nJL na [D*D?} + n b E n jA 6 A 6/ ]) _1 (^IEn a [A a A a, ]«a + n^D^D^a,). (5-20) 

The following result shows that the split-sample IV estimator a ab has the same large sample 
properties as the estimator a of the previous section but requires a weaker growth condition. 

Theorem 7 (Inference with a Split-Sample IV Based on Lasso or Post-Lasso). Suppose that 
data (yi,Xi, di) are i.n.i.d. and obey the linear IV model described in Section 2. Suppose also that 



Conditions AS, RF, SM, (2.1), (3.13) and (3.14) hold, except that instead of growth condition 
s 2 log 2 (p V n) = o(n) we now have a weaker growth condition s\og{p V n) = o(n). Suppose also 
that Condition SE hold for M k = E„J/f/f ] for k = a,b. Let D\ = fffif where pf is the 
Lasso or Post-Lasso estimator applied to the subsample {(d^,f^ c ) : 1 ^ i ^ n^} for k = a,b, 



and k c = {a, b}\k. Then the split-sample IV estimator based on equation (5.20) is -y/n- consistent 
and asymptotically normal, as n — >■ oo 

(Q-^Q- 1 )- 1 / 2 ^^ - oo) N(0,I), 
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for Q, := Fi[efD(xi)D(xi)'} and Q := F,[D(xi)D(xi)'] . Moreover, the result above continues 
to hold with replaced by f2 := K n [ef D(xi)D(xi) f ] for = yi — d'fiab, and Q replaced by 
Q:=E n [D(xi)D( Xi y]. 



6. Simulation Experiment 

The previous sections' results suggest that using Lasso for fitting first-stage regressions should 
result in IV estimators with good estimation and inference properties. In this section, we provide 
simulation evidence regarding these properties in a situation where there are many possible 
instruments. We also compare the performance of the developed Lasso-based estimators to 
many-instrument robust estimators that are available in the literature. 

Our simulations are based on a simple instrumental variables model data generating process 
(DGP): 

yi = pdi + ei ( fa'?, .... 

(ei,Vi) ~ N 0, l.i.d. 



di = z'-U + Vi 




where (3 = 1 is the parameter of interest, and zi = (zn, Zi2, zuoo)' ~ N(0,T,z) is a 100 x 
1 vector with E[zf h ] = <r 2 and Coiv(zih, Zij) = .5^~ h \. In all simulations, we set a 2 = 1 and 
<r 2 = 1. We also set Corr(e, v) = 0.6. 

For the other parameters, we consider various settings. We provide results for sample sizes, 
n, of 100 and 250. We set so that the unconditional variance of the endogenous variable 
equals one; i.e. a% = 1 — LT'E^II. We use three different settings for the pattern of the first- 
stage coefficients, II. In the first, we set IT = CTI = C(l, .7, .7 2 , .7 98 , .7")'; we term this the 
"exponential" design. In the second and third case, we set IT = CII = C(i s , n _ s )' where l s is a 
1 x s vector of ones and n _ s is a 1 x n — s vector of zeros. We term this the "cut-off" design 
and consider two different values of s, s = 5 and s = 50. In the exponential design, the model 
is not literally sparse although the majority of explanatory power is contained in the first few 
instruments. While the model is exactly sparse in the cut-off design, we expect Lasso to perform 

2 l .2 

poorly with s = 50 since treating - — ^— - as vanishingly small seems like a poor approximation 
given the sample sizes considered. We consider different values of the constant C that are chosen 
to generate target values for the concentration parameter, /x 2 = nn which plays a key role 
in the behavior of IV estimators; see, e.g. Stock, Wright, and Yogo (2002) or Hansen, Hausman, 
and Newey (2008), 12 Specifically, we choose C to solve fi 2 = for = 30 and for 



12 The concentration parameter is closely related to the first-stage Wald statistic and first-stage F-statistic 
for testing that the coefficients on the instruments are equal to 0. Under homoscedasticity, the first-stage Wald 
statistic is W = fl'{Z'Z)n/a^ and the first-stage F-statistic is W/ dim(Z). 
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fi 2 = 180. These values of the concentration parameter were chosen by using estimated values 
from the empirical example reported below as a benchmark^] 

For each setting of the simulation parameter values, we report results from seven different pro- 
cedures. A simple possibility when presented with many instrumental variables (with p < n) is 
to just estimate the model using 2SLS and all of the available instruments. It is well-known that 
this will result in poor-finite sample properties unless there are many more observations than 
instruments; see, for example, Bekker (1994). The estimator proposed in Fuller (1977) (FULL) 
is robust to many instruments (with p < n) as long as the presence of many instruments is 
accounted for when constructing standard errors for the estimators; see Hansen, Hausman, and 
Newey (2008) for example p*| We report results for these estimators in rows labeled 2SLS(100) 
and FULL(IOO) respectively^ For our variable selection procedures, we use Lasso to select 



among the instruments using the refined data-dependent penalty loadings given in (A. 21) de- 
scribed in Appendix A and consider two post-model selection estimation procedures. The first, 
Post-Lasso, runs 2SLS using the instruments selected by Lasso; and the second, Post-Lasso-F, 
runs FULL using the instruments selected by Lasso. In cases where no instruments are selected 
by Lasso, we use the point-estimate obtained by running 2SLS with the single instrument with 
the highest within sample correlation to the endogenous variable as the point estimate for Post- 
Lasso and Post-Lasso-F. In these cases, we use the sup-Score test for performing inference^] We 
report inference results based on the weak-identification robust sup-score testing procedure in 
rows labeled "sup-Score". 

The other two procedure "Post-Lasso (Ridge)" and "Post-Lasso-F (Ridge)" use a combination 
of Ridge regression, Lasso, and sample-splitting. For these procedures, we randomly split the 
sample into two equal-sized parts. Call these sub-samples "sample A" and "sample B." We 
then use leave-one-out cross-validation with only the data in sample A to select a ridge penalty 
parameter and then estimate a set of ridge coefficients using this penalty and the data from 
sample A. We then use the data from sample B with these coefficients estimated using only 
data from sample A to form first-stage fitted values for sample B. Then, we take the full-set of 
instruments augmented with the estimated fitted value just described and perform Lasso variable 



l^In the empirical example, first-stage Wald statistics based on the selected instruments range from between 
44 and 243. In the cases with constant coefficients, our concentration parameter choices correspond naturally to 
"infeasible F-statistics" defined as /i 2 /s of 6 and 36 with s — 5 and .6 and 3.6 with s = 50. In an online appendix, 
we provide additional simulation results. The results reported in the current section are sufficient to capture the 
key patterns. 

14 FULL requires a user-specified parameter. We set this parameter equal to one which produces a higher- 
order unbiased estimator. See Hahn, Hausman, and Kuersteiner (2004) for additional discussion. LIML is 
another commonly proposed estimator which is robust to many instruments. In our designs, its performance was 
generally similar to that of FULL, and we report only FULL for brevity. 

With n = 100, estimates are based on a randomly selected 99 instruments. 

16 Inference based on the asymptotic approximation when Lasso selects instruments and based on the sup-Score 
test when Lasso fails to select instruments is our preferred procedure. 
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selection using only the data from sample B. We use the selected variables to run either 2SLS 
or Fuller in sample B to obtain estimates of j3 (and associated standard errors), say @b,2SLS 
(sb,2SLs) and (3B,Fuiier (sB,Fuiier)- We then repeat this exercise switching sample A and B to 
obtain estimates of j3 (and associated standard errors) from sample A, say (3a,2SLS ( s a,2SLs) 
and f3 A ,Fuiier (sA,Fuller)- Post-Lasso (Ridge) is then w a ,2SLsPa,2SLS + (1 - wa,2SLs)Pb,2SLS for 

s 2 

W A,2SLS = ~2 B ' 2 J> L 2 S ; and Post-Lasso-F (Ridge) is defined similarly. If instruments are 

selected in one subsample but not in the other, we put weight one on the estimator from the 
subsample where instruments were selected. If no instruments are selected in either subsample, 
we use the single-instrument with the highest correlation to obtain the point estimate and use 
the sup-score test for performing inference. 

For each estimator, we report median bias (Med. Bias), median absolute deviation (MAD), 
and rejection frequencies for 5% level tests (rp(.05)). For computing rejection frequencies, we 
estimate conventional, homoscedastic 2SLS standard errors for 2SLS(100) and Post-Lasso and 
the many instrument robust standard errors of Hansen, Hausman, and Newey (2008) which rely 
on homoscedasticity for FULL(IOO) and Post-Lasso-F. We report the number of cases in which 
Lasso selected no instruments in the column labeled N(0). 

We summarize the simulation results in Table 1. It is apparent that the Lasso procedures are 
dominant when n = 100. In this case, the Lasso-based procedures outperform 2SLS(100) and 
FULL(IOO) on all dimensions considered. When the concentration parameter is 30 or s = 50, 
the instruments are relatively weak, and Lasso accordingly selects no instruments in many cases. 
In these cases, inference switches to the robust sup-score procedure which controls size. With 
a concentration parameter of 180, the instruments are relatively more informative and sparsity 
provides a good approximation in the exponential design and s = 5 cut-off design. In these 
cases, Lasso selects instruments in the majority of replications and the procedure has good risk 
and inference properties relative to the other procedures considered. In the n = 100 case, the 
simple Lasso procedures also clearly dominate Lasso augmented with Ridge as this procedure 
often results in no instruments being selected and relatively low power; see Figure 1. We also 
see that the sup-score procedure controls size across the designs considered. 

In the n = 250 case, the conventional many- instrument asymptotic sequence which has p 
proportional to n but p/n < 1 provides a reasonable approximation to the DGP, and one 
would expect FULL to perform well. In this case, 2SLS(100) is clearly dominated by the other 
procedures. However, there is no obvious ranking between FULL(100) and the Lasso-based 
procedures. With s = 50, sparsity is a poor approximation in that there is signal in the com- 
bination of the 50 relevant instruments but no small set of instruments has much explanatory 
power. In this setting, FULL(IOO) has lower estimation risk than the Lasso procedure which 
is not effectively able to capture the diffuse signal though both inference procedures have size 
close to the prescribed level. Lasso augemented with the Ridge fit also does relatively well in 
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this setting, being roughly on par with FULL(IOO). In the exponential and cut-off with s = 5 
designs, sparsity is a much better approximation. In these cases, the simple Lasso-based esti- 
mators have smaller risk than FULL (100) or Lasso with Ridge and produce tests that have size 
close to the nominal 5% level. Finally, we see that the sup-score procedure continues to control 
size with n = 250. 

Given that the sup-score procedure uniformly controls size across the designs considered but is 
actually substantially undersized, it is worth presenting additional results regarding power. We 
plot size-adjusted power curves for the sup-score test, Post-Lasso-F, Post-Lasso-F (Ridge), and 
FULL (100) across the different designs in the p? = 180 cases in Figure 1. We focus on \x = 180 
since we expect it is when identification is relatively strong that differences in power curves will 
be most pronounced. From these curves, it is apparent that the robustness of the sup-score test 
comes with a substantial loss of power in cases where identification is strong. Exploring other 
procedures that are robust to weak identification, allow for p 3> n, and do not suffer from such 
power losses may be interesting for future research. 

6.1. Conclusions from Simulation Experiments. The evidence from the simulations is sup- 
portive of the derived theory and favorable to Lasso-based IV methods. The Lasso- IV estimators 
clearly dominate on all metrics considered when p = n and s <C n. The Lasso-based IV esti- 
mators generally have relatively small median bias and estimator risk and do well in terms of 
testing properties, though they do not dominate FULL in these dimensions across all designs 
with p < n. The simulation results verify that FULL becomes more appealing as the sparsity 
assumption breaks down. This breakdown of sparsity is likely in situations with weak instru- 
ments, be they many or few, where none of the first-stage coefficients are well-separated from 
zero relative to sampling variation. Overall, the simulation results show that simple Lasso-based 
procedures can usefully complement other many-instrument methods. 

7. The Impact of Eminent Domain on Economic Outcomes 

As an example of the potential application of Lasso to select instruments, we consider IV 
estimation of the effects of federal appellate court decisions regarding eminent domain on a 
variety of economic outcomes^] To try to uncover the relationship between takings law and 
economic outcomes, we estimate structural models of the form 

Vet = a c + at + let + p Takings Law ct + W' ct 5 + e ct 



See Chen and Yeh (2010) for a detailed discussion of the economics of takings law (or eminent domain), 
relevant institutional features of the legal system, and a careful discussion of endogeneity concerns and the 
instrumental variables strategy in this context. 
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where y c t is an economic outcome for circuit c at time t, Takings Law c t represents the number of 
pro-plaintiff appellate takings decisions in circuit c and year t; W c t are judicial pool characteris- 
tics^] a dummy for whether there were no cases in that circuit-year, and the number of takings 
appellate decisions; and a c , at, and 7 c t are respectively circuit-specific effects, time-specific ef- 
fects, and circuit-specific time trends. An appellate court decision is coded as pro-plaintiff if the 
court ruled that a taking was unlawful, thus overturning the government's seizure of the property 
in favor of the private owner. We construe pro-plaintiff decisions to indicate a regime that is 
more protective of individual property rights. The parameter of interest, (3, thus represents the 
effect of an additional decision upholding individual property rights on an economic outcome. 

We provide results using four different economic outcomes: the log of three home-price-indices 
and log(GDP). The three different home-price-indices we consider are the quarterly, weighted, 
repeat-sales FHFA / OFHEO house price index that tracks single- family house prices at the state 
level for metro (FHFA) and non-metro (Non-Metro) areas and the Case-Shiller home price index 
(Case-Shiller) by month for 20 metropolitan areas based on repeat-sales residential housing 
prices. We also use state level GDP from the Bureau of Economic Analysis to form log(GDP). 
For simplicity and since all of the controls, instruments, and the endogenous variable vary only 
at the circuit-year level, we use the within-circuit-year average of each of these variables as the 
dependent variables in our models. Due to the different coverage and time series lengths available 
for each of these series, the sample sizes and sets of available controls differ somewhat across 
the outcomes. These differences lead to different first-stages across the outcomes as well. The 
total sample sizes are 312 for FHFA and GDP which have identical first-stages. For Non-Metro 
and Case-Shiller, the sample sizes are 110 and 183 respectively. 

The analysis of the effects of takings law is complicated by the possible endogeneity between 
governmental takings and takings law decisions and economic variables. To address the potential 
endogeneity of takings law, we employ an instrumental variables strategy based on the identifi- 
cation argument of Chen and Sethi (2010) and Chen and Yeh (2010) that relies on the random 
assignment of judges to federal appellate panels. Since judges are randomly assigned to three 
judge panels to decide appellate cases, the exact identity of the judges and, more importantly, 
their demographics are randomly assigned conditional on the distribution of characteristics of 
federal circuit court judges in a given circuit-year. Thus, once the distribution of character- 
istics is controlled for, the realized characteristics of the randomly assigned three judge panel 
should be unrelated to other factors besides judicial decisions that may be related to economic 
outcomes. 

There are many potential characteristics of three judge panels that may be used as instru- 
ments. While the basic identification argument suggests any set of characteristics of the three 

^The judicial pool characteristics are the probability of a panel being assigned with the characteristics used 
to construct the instruments. There are 30, 33, 32, and 30 controls available for FHFA house prices, non-metro 
house prices, Case-Shiller house prices, and GDP respectively. 
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judge panel will be uncorrelated with the structural unobservable, there will clearly be some 
instruments which are more worthwhile than others in obtaining precise second-stage estimates. 
For simplicity, we consider only the following demographics: gender, race, religion, political 
affiliation, whether the judge's bachelor was obtained in-state, whether the bachelor is from a 
public university, whether the JD was obtained from a public university, and whether the judge 
was elevated from a district court along with various interactions. In total, we have 138, 143, 
147, and 138 potential instruments for FHFA prices, non-metro prices, Case-Shiller, and GDP 
respectively that we select among using Lasso] 19 



Table 3 contains estimation results for /3. We report OLS estimates and results based on three 
different sets of instruments. The first set of instruments, used in the rows labeled 2SLS, are 
the instruments adopted in Chen and Yeh (2010) £j We consider this the baseline. The second 
set of instruments are those selected through Lasso using the refined data-driven penaltyp^j The 
number of instruments selected by Lasso is reported in the row "S" . We use the Post-Lasso 2SLS 
estimator and report these results in the rows labeled "Post-Lasso" . The third set of instruments 
is simply the union of the first two instrument sets. Results for this set of instruments are in 
the rows labeled "Post-Lasso+" . In this case, "S" is the total number of instruments used. 
In all cases, we use heteroscedasticity consistent standard error estimators. Finally, we report 
the value of the test statistic discussed in Section 4.3.1 comparing estimates using the first and 
second sets of instruments in the row labeled "Spec. Test" . 

The most interesting results from the standpoint of the present paper are found by comparing 
first-stage Wald-statistics and estimated standard errors across the instrument sets. The Lasso 
instruments are clearly much better first-stage predictors as measured by the first-stage Wald- 
statistic compared to the Chen and Yeh (2010) benchmark. Given the degrees of freedom, this 
increase obviously corresponds to Lasso-based IV providing a stronger first-stage relationship 
for FHFA prices, GDP, and the Case-Shiller prices. In the non-metro case, the p-value from 
the Wald test with the baseline instruments of Chen and Yeh (2010) is larger than that of the 
Lasso-selected instruments. This improved first-stage prediction is associated with the resulting 



19 Given the sample sizes and numbers of variables, estimators using all the instruments without shrinkage are 
only defined in the GDP and FHFA data. For these outcomes, the Fuller (1977) point estimate (standard error) 
is -.0020 (3.123) for FHFA and .0120 (.1758) for GDP. 

29 Chen and Yeh (2010) used two variables motivated on intuitive grounds, whether a panel was assigned an 
appointee who did not report a religious affiliation and whether a panel was assigned an appointee who earned 
their first law degree from a public university, as instruments. 

Lasso selects the number of panels with at least one appointee whose law degree is from a public university 
(Public) cubed for GDP and FHFA. In the Case-Shiller data, Lasso selects Public and Public squared. For 
non-metro prices, Lasso selects Public interacted with the number of panels with at least one member who 
reports belonging to a mainline protestant religion, Public interacted with the number of panels with at least one 
appointee whose BA was obtained in-state (In-State), In-State interacted with the number of panels with at least 
one non- white appointee, and the interaction of the number of panels with at least one Democrat appointee with 
the number of panels with at least one Jewish appointee. 
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2SLS estimator having smaller estimated standard errors than the benchmark case for non- 
metro prices, Case-Shiller prices, and GDP. The reduction in standard errors is sizable for both 
non-metro and Case-Shiller. Tthe standard error estimate is somewhat larger in the FHFA case 
despite the improvement in first-stage prediction. Given that the Post-Lasso first-stage produces 
a larger first-stage Wald-statistic while choosing fewer instruments than the benchmark suggests 
that we might prefer the Post-Lasso results in any case. We also see that the test statistics for 
testing the difference between the estimate using the Chen and Yeh (2010) instruments and the 
Post-Lasso estimate is equal to zero are uniformly small. Given the small differences between 
estimates using the first two sets of instruments, it is unsurprising that the results using the 
union of the two instrument sets are similar to those already discussed. 

The results are also economically interesting. The point estimates for the effect of an ad- 
ditional pro-plaintiff decision, a decision in favor of individual property holders, are positive, 
suggesting these decisions are associated with increases in property prices and GDP. These 
point estimates are all small, and it is hard to draw any conclusion about the likely effect on 
GDP or the FHFA index given their estimated standard errors. On the other hand, confidence 
intervals for non-metro and Case-Shiller constructed at usual confidence levels exclude zero. 
Overall, the results do suggest that the causal effect of decisions reinforcing individual property 
rights is an increase in the value of holding property, at least in the short term. The results are 
also consistent with the developed asymptotic theory in that the 2SLS point-estimates based on 
the benchmark instruments are similar to the estimates based on the Lasso-selected instruments 
while Lasso produces a stronger first-stage relationship and the Post-Lasso estimates are more 
precise in three of the four cases. The example suggests that there is the potential for Lasso to 
be fruitfully employed to choose instruments in economic applications. 



It is useful to organize the precise implementation details into the following algorithm. We 
establish the asymptotic validity of this algorithm in the subsequent sections. Feasible options 
for setting the penalty level and the loadings for j = 1, . . . ,p, and I = 1, . . . , k e are 



where c > 1 is a constant, 7 £ (0,1), di := E n [cZjz] and vu is an estimate of vu. Let K ^ 1 
denote a bounded number of iterations. We used c = 1.1, 7 = 0.1/log(p V n), and K = 15 
in the simulations. In what follows Lasso/Post-Lasso estimator indicates that the practitioner 
can apply either the Lasso or Post-Lasso estimator. Our preferred approach uses Post-Lasso at 
every stage. 



Appendix A. Implementation Algorithms 



refined 



initial 




(A.21) 
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Algorithm A.l (Lasso/Post-Lasso Estimators). (1) For each I = l,...,k e , specify penalty 
loadings according to the initial option in (A. 21). Use these penalty loadings in computing 
the Lasso/Post-Lasso estimator fii via equations (2.4) or (2.8). Then compute residuals vu = 
du — f'iPi, i = 1, n. (2) For each I = 1, . . . , k e , update the penalty loadings according to the 
refined option in (A. 21) and update the Lasso/Post-Lasso estimator Pi. Then compute a new 
set of residuals using the updated Lasso/Post-Lasso coefficients vu = du — f[Pi, i = 1, ...,n. (3) 
Repeat the previous step K times. 



If the Algorithm |A.1| selected no instruments other than intercepts, or, more generally if 
'E n [D i iD' i j\ is near-singular, proceed to Algorithm 
algorithm. 



A.3 



otherwise, we recommend the following 



Algorithm A. 2 (IV Inference Using Estimates of Optimal Instrument). Compute the estimates 
of the optimal instrument, Du = f-/3i, for i = 1, n and each I = 1, k e , where (3i is computed 
by Algorithm A.l, Compute the IV estimator a = E n [L>j(i^] _1 E„,[L'jyj] . (2) Compute estimates 



of the asymptotic variance matrix Q l QQ 1 where SI := E ra [efZ)jZ^] for u = yi — d^a, and 



Q := E n [DjZ).]. (3) Proceed to perform conventional inference using the normality result (2.10). 



The following algorithm is only invoked if the weak instruments problem has been diagnosed, 
e.g., using the methods of Stock and Yogo (2005). In the algorithm below, „4i is the parameter 
space, and Q\ C A\ is a grid of potential values for a\. Choose the confidence level 1 — 7 of the 
interval, and set A(l — 7) = c^fn<&~ 1 {l — 7/2p). 

I. (2) For each 



Algorithm A.3 (IV Inference Robust to Weak Identification). (1) Set C 



a G Q\ compute A a as in (4-19). If A a ^ A(l — 7) add a to C. (3) Report C. 



Appendix B. Tools 

The following useful lemma is a consequence of moderate deviations theorems for self-normalized 
sums in Jing, Shao, and Wang (2003) and de la Pena, Lai, and Shao (2009). 

We shall be using the following result - Theorem 7.4 in de la Peha, Lai, and Shao (2009). 
Let Xi,...,X n be independent, zero-mean variables, and S n = X^Li^i; V% = Y17=i-^f- 
< /i ^ 1 set Bl = YJl=i EX?, L n , M = £™ =1 ElX^, d n ^ = B n /Llf +Il) . Then uniformly 
in ^ x ^ d n ^, 



$0) V dn,n J ' *(-x) V d 



where the terms 0(1) are bounded in absolute value by a universal constant A, <3? := 1 — <£, and 
$ is the cumulative distribution function of a standard Gaussian random variable. 
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Lemma 5 (Moderate Deviation Inequality for Maximum of a Vector). Suppose that 



where Uij are independent variables across i with mean zero. We have that 

P ( max \S S \ > - 7/2p)J < 7 ^1 + A 



11 r 



where A is an absolute constant, provided that for £ n > 

(i£r=i^-) 1/2 

< - 7/(2p)) < -7- min Af [L/j] - 1, M[L/j] := ' 



Proof of Lemma [5| Step 1 . We first note the following simple consequence of the result of 
Theorem 7.4 in de la Peha, Lai, and Shao (2009). Let Xi iTt , ...,X n ^ n be the triangular array of 
i.n.i.d, zero-mean random variables. Suppose that 

( 1 v n FA" 2 \ 1 / 2 



(^Er=iE|x, n |3)i/3- 

Then uniformly on ^ x ^ n l l & M n /l n — 1, the quantities S n ^ n = Y^i=i^-i,n an d V% n 
EILi^nObey 

A 



P(|5'n,n/^ / n,n| ^ x) 



2*(x) 



This corollary follows by the application of the quoted theorem to the case with \i = 1. The 
calculated error bound follows from the triangular inequalities and conditions on £ n and M n . 

Step 2. It follows that 

P( max |&| > $ _1 (1 - j/2p) ) < (:0 pmaxP(|&| > - 7/2p)) 

=(2) pP (|5j„l > - 7/?p)) <(3) p2*(4.-'(l - 7 /2p)) (l + 

<W(*)(l + |)<7(l + |), 

on the set ^ 3> _1 (1 — -y/(2p)) ^ ^j-^-Mj n — 1, where inequality (1) follows by the union bound, 
equality (2) is the maximum taken over finite set, so the maximum is attained at some j n E 
{1, ...,p}, and the last inequality follows by the application of Step 1, by setting Xi^ n = Uij n . □ 
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Appendix C. Proof of Theorem 1 

The proof of Theorem [T] has four steps. The most important steps are the Steps 1-3. One half 
of Step 1 for bounding || • ||2 in -rate follows the strategy of Bickel, Ritov, and Tsybakov (2009), 
but accommodates data-driven penalty loadings. The other half of Step 1 for bounding the 
|| • ||i-rate is new for the nonparametric case. Step 2 innovatively uses the moderate deviation 
theory for self-normalized sums which allows us to obtain sharp results for non-Gaussian and 
heteroscedastic errors as well as handle data-driven penalty loadings. Step 3 relates the ideal 
penalty loadings and the feasible penalty loadings. Step 4 puts the results together to reach the 
conclusions. 

Step 1. For C > and each I = 1, . . . , k e , consider the following weighted restricted eigenvalue 



mm 



SeRP: ||fj»5 T c||i^C||T05 Ti ||i,||5|| 2 ^0 HT^tJIi 

This quantity controls the modulus of continuity between the prediction norm ||//5||2n and 
the £i-norm \\o~\\i within a restricted region that depends on / = l,...,k e . Note that if 
a = min min Y?„- < max ||T?|| 00 = b, for every C > 0, because {5 6 MP : II Y^t^IIi ^ 

C||Tp*r,||i} C {5 G W : a||<5 T =||i < &C||<frJ|i} and HXp*zj|li < b\\S Tl \\i, we have 

min k 1 c ^ (l/b)K (bC/a) (E n [fifl}) 



where the latter is the restricted eigenvalue defined in (3.12). If C = cq = (uc + l)/(£c — 1) we 
have minisg^fc,, k 1 Cq ^ (^/b)Kc(E n [fif-]). By Condition RF and by Step 3 of Appendix [c| below, 
we have a bounded away from zero and b bounded from above with probability approaching one 
as n increases. 

The main result of this step is the following lemma: 



Lemma 6. Under Condition AS, if X/n ^ c\\Si\\oo, and Ti satisfies (3.13) with u 1 ^ £ > 1/c 
then 

\\fl0l-^ O )b,n^(u+-)^- + 2Cs, 

~ Ao)lli < 3c ^ ( (u + [1/c])^ + 2c s ) + 3 -^cl 
where cq = (uc + l)/(£c — 1). 



K 2c V nK c 



Proof of Lemma^ Let 5[ :=/?/ — /3;o- By optimality of (5i we have 

A 



Ql0l) ~ QliPia) < " (l|T/Ao||i - l|T/A||i) • (C22) 
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"*0\-1t 



Expanding the quadratic function Qi, and using that Si = 2(Y;) K n [vufi], we have 



Qi(Pi)-Qi(Pio)-\\fft 



l\\2,n 



= \2E n [v i ifl5i} + 2E n [a u fM 
< ||5 { [|oo[|Tp<y,|| 1 + 2c,[|/^,|| 2in . 



(C.23) 



So combining (C.22) and (C.23) with X/n c^Hoo and the conditions imposed on T; in the 



statement of the theorem, 



Wf'Mln ^ 



X 



n 



Yi6i T \\i - \\Ti5i T c\\i) + ||5i[| DO [|T°^||i + 2c s ||/^|| 2 . 



u + 



1\ X, 



n 



tjv.i 



1\ x, 



(C.24) 



n 



T^ T c|| 1 + 2c s ||/^|| 2 , n . 



To show the first statement of the Lemma we can assume ||/^<5z|| 2 ,n ^ 2c s , otherwise we are 



done. This condition together with relation ( C.24 ) implies that for cq = (uc+ 1) / (£c— 1) we have 
ll T ?^Tfl|i ^ c || f ^ T J|i. Therefore, by definition of k 1 cq , we have ||T^ T J|i < V^II/t'^IIW^co- 
Thus, relation jcSil implies ||/^|| 2 ,„ < («+ z) ~r\\f&hn + 2c s ||/^|| 2 , n and the result 
follows. 



To establish the second statement of the Lemma, we consider two cases. First, assume 
|| TTp^jyc ||i ^ 2co||Tj5jr J ||i. In this case, by definition of k 1 2co , we have 

||T^||i < (l + 2 C0 )||T^ T ||i < (l + 2c )v^||/>||2,n/4c 

and the result follows by applying the first bound to ||/^|| 2 ,n- On the other hand, consider the 
case that 

||TiVc||i >2co||T?(J ir J|i (C.25) 



which would already imply ||//<^||2,n ^ 2c s by (C.24). Moreover, 



ffiSnfh C ||T^ T J|i + ^f||/^|| 2 , n (2 Cs -||/^|| 2 ,n) 

i||T°<W|| 1 + ' 



< (2) C0 ||T^ T J|i + ^Sc^ 



(3) 2 I 



lc-1 A°S! 



where (1) holds by (C.24), (2) holds since ||/^|| 2 , n (2c s - ||/-^|| 2 ,n) < max x ^ x(2c s 



x) < d 



and (3) follows from (C.25). Thus 



1 \ 2c n o 
^^[ 1+ 2^)i^lX^ 



and the result follows from noting that cj{tc— 1) ^ cq/u ^ cq and 1 + 1/2cq ^3/2. 



□ 



Step 2. In this step we prove a lemma about the quantiles of the maximum of the scores Si = 
2E n [(T°) _1 /jt;j;], and use it to pin down the level of the penalty. For A = c2y / n<3? -1 (l— 7/ (2k e p)), 
we have that as 7 — > and n — > 00, P (cmax^^fc e H|<Sz||oo > X) = o(l), provided that for some 
b n — > 00 





2$- 1 (l- 7 /(2£; e p)) < 



mm 



Mji, M. 



jl :z 



E[|^| 3 |^| 3 ] 1/3 ' 



SPARSE MODELS AND METHODS FOR OPTIMAL INSTRUMENTS 



37 



Note that the last condition is satisfied under our conditions for large n for some b n — > oo, since 
k e is fixed, log(l/7) < log(pVn), i^log 3 (p V n) = o(n), and mini^^i^^^ M jt > 1/Kl /3 . 
This result follows from the bounds on moderate deviations of a maximum of a vector provided 
in Lemma 5, by ^ 4>{t)/t, max^j,^^ l/Af/j < A^ 3 , and /f^ 3 log(p V n) = o(n 1 / 3 ) holding 
by Condition RF. 

Step 3. The main result of this step is the following: Define the expected "ideal" penalty 
loadings := diag (y^Ef/^t^], \J^[fi p vfj\^ , where the entries of T° are bounded away from 
zero and from above uniformly in n by Condition RF. Then the empirical "ideal" loadings 
converge to the expected "ideal" loadings: max\^i^ e \\T® — T°||oo — >p 0. This is assumed in 
Condition RF. 

Step 4. Combining the results of all the steps above, given that A = 2c^R^- 1 {l--i/{2pk e )) < 
c^Jn log(pfc e /7), k e fixed, and asymptotic valid penalty loadings Yj, and using the bound c s <p 
\/s/n from Condition AS, we obtain the conclusion that 



^-0 



which gives, by the triangular inequality and by \\Du — /j'Aol|2,n ^ c s <p yfsjn holding by 
Condition AS, 



in n ll < 1 s1o §(Wt) 
\i->il — Ua 2,n ^P 



4 V n 



The first result follows since Kq <p by Step 1. 

To derive the £i-rate we apply the second result in Lemma [6] as follows 



^-Aolli^lKTzVlUIIT^A-Ao)!!! 



;p IKT^Hoo # 1 >log(W7) + M + _^J^ * < p 1 



4c \4o V n V «/ V 1o S(p/7) 



That yields the result since k 2 c ^Sp k 2c by Step 1. □ 



Appendix D. Proof of Theorem 2 

The proof proceeds in three steps. The general strategy of Step 1 follows Belloni and Cher- 
nozhukov (2011a) and Belloni and Chernozhukov (2012), but a major difference is the use of 
moderate deviation theory for self-normalized sums which allows us to obtain the results for 
non-Gaussian and heteroscedastic errors as well as handle data-driven penalty loadings. The 
sparsity proofs are motivated by Belloni and Chernozhukov (2012) but adjusted for the data- 
driven penalty loadings that contain self-normalizing factors. 
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Step 1. Here we derive a general performance bound for Post-Lasso, that actually contains 
more information than the statement of the theorem. This lemma will be invoked in Step 3 
below. 

Let F = [fi; ■ ■ ■', f n ]' denote a n by p matrix and for a set of indices S C {1, . . . ,p} we define 
Vs = F[S](F[S]' FlS])' 1 F[S]' denote the projection matrix on the columns associated with the 
indices in S. 

Lemma 7 (Performance of the Post-Lasso). Under Conditions AS and RF, let T\ denote the 
support selected by = flu, T\ C I\, rh\ = \I\ \ T\\, and Pipl be the Post-Lasso estimator based 
on Ii, I = 1, . . . , k e . Then we have 



n n t'R n < [* / fc eAlog sfc e y/mi log(pk e ) 

max \\Du- JiPiPL 2,n \ ~\ — r r~\ — +max^= + (A - T? D{) y/n 2 , 

(||T°||oo + ||Ti-T°|| 00 ) x /mF+^ ^ 
max ||Tj(Api -Ao)||l < max ^ ^ II A'(Apl - Ao) Ikn- 

l<i<fc e KKfce V0min(m/ + s) 



7f in addition X/n cH^Hoo, and T/ satisfies (3.13) with u ^ 1 ^ £ > 1/c in z:/ie /irs£ stage 
/or Lasso for every I = 1, . . . , k e , then we have 

max II (Di - VfDi)/y/n\\2 ^ max ( u + - ) + 3c s . 

z^fc e h " i^ ke y C J nK i co 

Proof of Lemma\^ We have that D\ — F(3ipl = (I — Vf^Di — Vjv\ where I is the identity 
operator. Therefore for every I = 1, . . . , k e we have 

||A - FPip L \\ 2 < Pjj)A||2 + ll^Vzlb + H^ VTi ^l|2. (D.26) 



Since ||F[J, \ T^/^F^ \ T{\> 'F[J, \ T^/n)- 1 )! < yjl/f^ifhi), ™z = |J, \ T,|, the last term in 



(D.26) satisfies 

H^^filb < \/l/(Amin(mOll i? [^ \ T^Wv^lb < <^mm (^/) 1 1 F'v t / y/ri\\ 

Under Condition RF, by Lemma [5] we have 



max WF'vt/y/nWoa < P y / log(pke} max J^ n [ff-vl}. 

1=1,..., k e l^k e ,j^p V J 



Note that Condition RF also implies max;<jfc e j^ p yE n [/£v]j] <p 1 since max^^^j, | (E n 
-+P and max,^^ E < max^^ E[/g4] ^ L 
We bound the second term in ( |D.26 ) in two ways. First, proceeding as above we have 



max ||Pt^z||2 <P \/log(k e s)^s/(f) min (s) max JE n [/?.^]. 
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Second, since £[||F[T^|||] = E[£ ieT! (£-=i f ijVil f] = E i6Ti £?=i E[/^> we have 



max \\V Tl vih V sk e/<l>mm{s) max y E[/|v ( 2 J. 
These relations yield the first result. 

Letting 5; = — Ao> the statement regarding the ^i-norm of the theorem follows from 

||T,a { ||i < HTillooll^Hi < HTiHooVlKll^lla < milooVW^Wihn/VMllSih), 
and noting that ||<5j||o < fhi + s and HTjHoq < HT^I^ + ||T/ - T^||oo. 

The last statement follows from noting that the Lasso solution provides an upper bound to 
the approximation of the best model based on Ii, since T/ C I;, and the application of Lemma 
El □ 

Comment D.l (Comparison between Lasso and Post-Lasso performance). Under mild condi- 



tions on the empirical Gram matrix and on the number of additional variables, Lemma 10 below 
derives sparsity bounds on the model selected by Lasso, which establishes that 

\fi \ Ti\ = fhi < P s. 

Under this condition, we have that the rate of Post-Lasso is no worse than Lasso's rate. This 
occurs despite the fact that Lasso may in general fail to correctly select the oracle model T; as 
a subset, that is T\ %T\. However, if the oracle model has well-separated coefficients and the 
approximation error does not dominate the estimation error, then the Post-Lasso rate improves 
upon Lasso's rate. Specifically, this occurs if Condition AS holds, fhi = op(s) and TJ C T\ wp 
— > 1, or if T = T wp — > 1 as under the conditions of Wainwright (2009). In such cases, the rates 
found for Lasso are sharp, and they cannot be faster than \J s logp/n. Thus, the improvement 
in the rate of convergence of Post-Lasso over Lasso is strict these cases. Note that, as shown 
in the proof of Lemma [8j a higher penalty level will tend to reduce fhi but will increase the 
likelihood of 7} % 7). On the other hand, a lower penalty level will decrease the likelihood of 
T\ 2 T\ (bias) but will tend to increase fhi (variance) . The impact in the estimation of this trade 
off is captured by the last term of the bound in Lemma [7} 

Step 2. In this step we provide a sparsity bound for Lasso, which is important for establishing 
various rate results and fundamental to the analysis of Post-Lasso. It relies on the following 
lemmas. 

Lemma 8 (Empirical pre-sparsity for Lasso). Let T\ denote the support selected by the Lasso 
estimator, fhi = |T) \ T\\, and assume that X/n c[|5j||oo and u ^ 1 ^ £ > 1/c as in Lemma^ 
Then, for cq = (uc + l)/(£c — 1) we have 

2-y/s Qnc s 



^ vVmax("lz)||(T°) 1 || 



4 A 
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Proof of Lemma [#| We have from the optimality conditions that the Lasso estimator j3i = (3n 
satisfies 

2E n [f yVo-fc/i - f'M = sign(Ai)A/n for each j € f, \ T,. 

Therefore, noting that ||T^ 1 T[ ) || 00 ^ l/£, we have for i? = (a/i, • • • ,ai n )' and F denoting the 
n x p matrix with rows /■ , i = 1, . . . , n 



'm l \ = 2\\{T^F l {Y-F(3 l )) fm \\ 2 
^ 2||(T r 1 ^'(r -R- F/3 l0 )) fATi ||a + 2|| (T^F'ii)^ || 2 + 2||(T r 1 F'F(Ao - POh^h 
< V™^ «||T^ 1 T?|| 00 ||S';|| 00 + 2n v /</) max (m i )||Tf 1 || 00 c s + 2n v /0 max (m i )||T^ 1 || oo ||/,'(/3 ; - /3; )||2,n, 



«S V™^ (1/^) HI^IU + 2n v / 4> nlax (m,i) ^ — c s + 2n\/</> max (m ; ) l — —\\fl(Pl ~ Mh,n, 

where we used that 

\\(F' F(/3 l0 -di)) f t \ Tl h 

= su P||5||o<mi,||<5||2<l WF^FiM_- A)l < su P||g|lo^mi, P|l 2 ^l ¥' F 'h\\ F {PlQ - A)||2 
< SUP||«[|o<#a,,||*|| a <l V\ S ' F ' FS \W F (PlO - A)l|2 *S nV0max(^)ll/KAo " ft)||a,n, 
and similarly 

IKF'i?)^^!^ = suPpilo^fn^pilassi \5'F'R\ ^ sup^ 0<fhl ,\\5\\ 2 ^i¥' F 'h\\Rh 

= su Pp||o<m,,||«|| 2 ^l V\ 5,F/FS \W R h < ny/^ max (fhi)c s . 

- flo)|| 2 ,n < (« + i) ^ + 2c s we have 



Since A/c ^ rall-S/Hoo) and by Lemma 

2^max(mi) M2 ^P 



The result follows by noting that (u+ [l/c])/(l — 1 /[&:]) = cq£ by definition of c®. □ 

Lemma 9 (Sub-linearity of maximal sparse eigenvalues). Let M be a semi-definite positive 
matrix. For any integer k and constant t 1 we have (f>max.(\£k~\)(M) \£}(f>max.(k)(M). 

Proof. Denote by (pM(k) = <f>max(k)(M), and let ct achieve cpM^k). Moreover let Y^\=i a i = & 
such that Yll=i \\ a i\\o = ||«||o- We can choose a»'s such that ||ai[|o ^ fc since \f\k ^ £A;. Since 
M is positive semi-definite, for any i,j w a^Mcti + a'jMctj ^ 2 \a[Maj\ . Therefore 

m m m 

M (#c) = tt'Ma = ^ a -Ma, + ^ ^ a -May < ^ {a -Mai + ({£] - l^Ma*} 
< ffl 52 Hf&KRIlo) < max <MR||o) < \£\<t> M (k) 

. =i i=i,..,m 

where we used that 5Z[=i ll a j|| 2 = 1- ^ 
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Lemma 10 (Sparsity bound for Lasso under data-driven penalty). Consider the Lasso estimator 
Pi = Al with X/n ^ c||5j||oo> an d let fhi = \T\ \ T\\. Consider the set 

M = jm G N : m > s 2</> max (m)||(T°)- 

Then, 

fh^s (nun U"»An)) llffl)" 1 ^ ( ^ + • 

Comment D.2 (Sparsity Bound). Provided that the regularization event X/n ^ c\\Si\\oo occurs, 
Lemma [To] bounds the number of components fhi incorrectly selected by Lasso. Essentially, the 
bound depends on s and on the ratio between the maximum sparse eigenvalues and the restricted 
eigenvalues. Thus, the empirical Gram matrix can impact the sparsity bound substantially. 
However, under Condition SE, the ratio mentioned is bounded from above uniformly in n. As 
expected the bound improves and the regularization event is more likely to occur if a larger 
value of the penalty parameter A is used. 



2co Qconc s 



X^S 



Proof of Lemma 10. Rewriting the conclusion in Lemma [8] we have 



mi < s 4> 

max 

(mourn 



Woo 



2co 6conc s 

1 T 



"CO 



X^~s 



(D.27) 



Note that fhi ^ n by optimality conditions. Consider any M € A4, and suppose fhi > M. 
Therefore by Lemma [9] on sublinearity of sparse eigenvalues 

2 



mi 
M 



mi ^ s 

Thus, since \k~\ 2k for any k ^ 1 we have 

MO20 max (M)||(i7r 1 ||S 



2 Co QCQflCg 



X^s 



2cq 6conc s 



"C 



X^~s 



which violates the condition that M E M. Therefore, we have fhi M. 



In turn, applying (D.27) once more with fhi ^ (M A n) we obtain 

mi ^ s 0max(M A n)||(T°) -1 ||g 



2cq 6conc s 



K 



4 ' 

The result follows by minimizing the bound over M E Ai. 



□ 



Step 3. Next we combine the previous steps to establish Theorem [2} As in Step 3 of Appendix 
[(5J recall that max^^^ ||T° - T^oo -)-p 0. 

Let k be the integer that achieve the minimum in the definition of u 2 . Since c s <p \fs~Jn leads 
to nc s /\Xyfs\ —tp 0, we have that k £ M with high probability as n — > oo. Moreover, as long as 
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X/n ^ cmaxi^;;gfc e US'/Hoc, I — >p 1 and c > 1, by Lemma 10 we have for every I = 1, . . . , k e that 
™-l <p sp 2 <f) min (k + s)/kq <p sp 2 (p min (fhi + s)/«^ (D.28) 
since k £ A4 implies k ^ rh\. 



By the choice of A = 2c^fn<&~ 1 {l — j/(2pk e )) in (2.7), since 7—7-0, the event X/n 
cmaxi^;^ 1 1 Si || 00 holds with probability approaching 1, Therefore, by the first and last re- 
sults in Lemma [7] we have 



max \\Dii - fiPiPLlhn <P — \ h c s + max — r . 

Because max^^ fce < max^j^^ ||r°||oo/«c by Step 1 of Theorem[lJ we have 

11 n t'a 11 <r V I s lo §( Wt) mon x 
max \\Dii - fiPiPL^n <p — A/ (D.29) 



is£Zs£fc e Kq V n 



since k e < p and c s <p yjs/n. That establishes the first inequality of Theorem [ij 

To establish the second inequality of Theorem [ij since Wfiipi — Aollo ^ ^ty + s, we have 

IIApl - Aolla < \IWipl-MSipl - A0II2 < V^+^ y ' i( ^L~ Mhn . 

VPmmimi + S) 



The sparsity bound (D.28), the prediction norm bound (D.29), and the relation ||-Da — f^iPL^ n ^ 



°s + ||/i(APL — Ao)||2,n yield the result with the relation above. □ 

Lemma 11 (Asymptotic Validity of the Data-Driven Penalty Loadings). Under the conditions 
of Theorem [7] and Condition RF or the conditions of Theorem [#] and Condition SE, the penalty 
loadings T constructed by the K-step Algorithm A.l are asymptotically valid. In particular, for 
K > 2 we have v! = 1. 



For proof of Lemma 11 see Online Appendix. 



Appendix E. Proofs of Lemmas 1-4 

For proof of Lemma 1, see Belloni and Chernozhukov (2011a), Supplement. For proof of 
Lemma 2, see Belloni and Chernozhukov (2012). For proofs of Lemma [3] and |4j see Online 
Appendix. 



Appendix F. Proofs of Theorems 3-7. 

F.l. Proof of Theorems [3] and [4| The proofs are original and they rely on the consistency 
of the sparsity-based estimators both with respect to the L 2 (P n ) norm || • \\2 >n and the £i-norm 
|| • These proofs also exploit the use of moderate deviation theory for self-normalized sums. 
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Step 0. Using data-driven penalty satisfying (2.7) and ( |3.13 ), we have by Theorem [T] and 
Condition RE that the Lasso estimator and by Theorems [2] and Condition SE that the Post- 
Lasso estimator obey: 



h ft I , /slog(pVn) 
max \\Dii - Du\\2,n ^P \ > 



P Pi 



s 2 log 2 (pVn) 

i $;P \ > 0. 

V n 



(F.30) 
(F.31) 



In order to prove Theorem [3] we need also the condition 

max || 5, - DuWiy/* < P *JE&>lA n V* 0f 



with the last statement holding by Condition SM. Note that Theorem [4] assumes ( F.30 )-( F.31 ) 
as high level conditions. 

Step 1. We have that by EfolA] = 

V^(a - a ) = E n [A^]- 1 v ^E„[Ae i ] = {E n [5 i d-]} _1 (G ri [Aei] + o P (l)) 
= {E[A<4] + op(l)}" 1 (G n [Aei] + op(l)) , 

where by Steps 2 and 3 below: 



E n [A<] = E[A41 + °p(l) 

v^E n [Aei] = G n [Aei] + o P (l) 



(F.32) 
(F.33) 



where E[A<^] = E[AAj] = Q is bounded away from zero and bounded from above in the 
matrix sense, uniformly in n. Moreover, Var(G n [Aei]) = ^ where tt = cr 2 E[AA!] under 
homoscedasticity and fl = E[e 2 AA^] under heteroscedasticity. In either case we have that $7 is 
bounded away from zero and from above in the matrix sense, uniformly in n, by the assumptions 
the theorems. (Note that matrices and Q are implicitly indexed by n, but we omit the index 
to simplify notations.) Therefore, 

n(a - q ) = Q _1 G n [Aej] + o P (l), 



and Z n = (Q^nQ- 1 )- 1 / 2 ^^ - a ) = G n [z i>n ] + o P (l), where z i<n = (Q- 1 nQ- 1 )- 1 / 2 Q- 1 D i e i 
are i.n.i.d. with mean zero and variance /. We have that for some small enough 5 > 



E||2 : i,n|l2 +<5 ~ E 



\D 



1 2+5 1 12+6 



i\\2 



< 1, by Condition SM. This condition verifies the Lyapunov 
condition, and the application of the Lyapunov CLT for i.n.i.d. triangular arrays and the 
Cramer- Wold device implies that Z n — >d N(0,I). 
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Step 2. To show (F.32), note that 



||E n [(A-AK] 



sc E„[||A - AII2IKH2] ^ V E «[IIA - AllillnOI^Hi] 



\ 



E,. 



1=1 



En[|K|||] < Vke max \\D a - D a \\ 2 , 



Kn[\\d t \\l] 



< 



P max 1 1 A; 



Ai||2,n = Op(l). 

where y / E^Jj|di|jlJ 1 by E||dj||2 < 1 and Chebyshev, and the last assertion holds by Step 0. 

Moreover, E n [_Dj_D^] — E[A^] — by von Bahr-Essen inequality (von Bahr and Esseen, 
1965) using that EfHAHj;] f° r a fixed q > 2 is bounded uniformly in n by Condition SM. 



Step 3. To show (F.33), let an := ai(xi), note that Ef/^ej] = 0, E[e»| JD#] = and E[ej/|a^] = 0, 
and 

max \y/nE n [(Du - Du)ei]\ 

p 

= max I y/nE n {fl0i - Ao)ei} - G n {a^ej}| = max | V] G n {/jjei}'(^,- - /%,•) - G„{aj/ej}| 

G n [/jjej] 



^ max 



max \/E n [/?.£?] max 



1 + max |G n {a^ei}|. 



Next we note that for each I = 1, . . . , k e |G n {a^ej}| <p [E„a 2 J 1//2 <p \J s/n — > 0, by the Condi- 
tion AS on [Ena^] 1 / 2 and by Chebyshev inequality, since in the homoscedastic case of Theorem 



Var [G n {au€i}\xi, x n ] ^ <r 2 E n a|, and in the bounded heteroscedastic case of Theorem 



Var[G n {a ii e i }|xi,...,x„] < E n a 4 2 . Next we can bound maxi<y<g p & n [fijei]/* /E n [f 2 -e 2 ' 



'ij 1. 



<p 



y/\ogp provided that p obeys the growth condition logp = o(n 1//3 ), and 



min M. 



KKp 



jo ■-- 



3] 1/3 



> 1. 



(F.34) 



This result follows by the bound on moderate deviations of a maximum of a self-normalized vec- 
tor stated in LemmajSJ and by (F.34) holding by Condition SM. Finally, maxi^j^p K n [f^ef] <p 1, 
by Condition SM. Thus, combining bounds above with bounds in (F.30)-(F.31 ) 



a IA[(%-A,),]|<, f"° 8 'f V - 1 + 



0. 



where the conclusion by Condition SM (iii). 



Step 4. This step establishes consistency of the variance estimator in the homoscedastic case 
of Theorem [3l 

Since a 2 and Q = E[A-Di] ar e bounded away from zero and from above uniformly in n, it 
suffices to show a 2 -a 2 ^ P andE n [A^]-E[A^] ^P 0. Indeed, a 2 = E n [(e;-^(a-a )) 2 ] = 
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E n [e 2 ] + 2E n [e i d' i (a - a)} + E n [(d-(a - a)) 2 } so that E n [e 2 } - a 2 -> P by Chebyshev inequality 
since E[|ej| 4 ] is bounded uniformly in n, and the remaining terms converge to zero in probability 
since a — «o by Step 3, ||E n [djej] H2 <p 1 by Markov and since E ] | ^ [ 1 2 ^ Y^E|pi||f V^NP 
is uniformly bounded in n by Condition SM, and E n ||(ij||| < P 1 by Markov and E||dj[|| bounded 
uniformly in n by Condition SM. Next, note that 

||E n [A^] - E n [AA!]|| = ||E n [A(A - A)' + (A - A) A;] +E n [(A - A)(A - A)']|| 

which is bounded up to a constant by 

k e max || A* - Azlknll || Alhlh.n + K max ||Az - D a \\ 2 2n -> P 



by (F.30) and by || HAlhlh.n ^Sp 1 holding by Markov inequality. Moreover, E n [AA!l — 

e[aa!1 ^p by Ste P 2 - 

Step 5. This step establishes consistency of the variance estimator in the boundedly het- 
eroscedastic case of Theorem [3j 

Recall that f2 := E n [e? D(xi)D(xi)'] and := E[e? D(x,i) D(xi)'], where the latter is bounded 
away from zero and from above uniformly in n. Also, Q = E[AA;] is bounded away from 
zero and from above uniformly in n. Therefore, it suffices to show Q — Q — >p and that 
E n [AA^] — E[AA;] — >-p 0. The latter has been shown in the previous step, and we only need 
to show the former. 

In what follows, we shall repeatedly use the following elementary inequality: for arbitrary 
non- negative random variables W\, W n and q > 1: 

max Wi < n l/q if E[W?] < 1, (F.35) 

which follows by Markov inequality from E[max^ n W t ] ^ n l / q E (± Ya=i W q f ,q n 1 /i{E[W q ]) l ' q , 
which follows from the trivial bound maxj<g n \ wi\ ^ Yli=i \ w i\ an< ^ Jensen's inequality. 

First, we note 

||E n [(e? - eDDiD'M sC ||E n [{<(S - a )} 2 A^]|| + 2||E n [e i dJ(5 - a )D i D' i \\\ 
< P max ||d i ||2n- 1 ||E„[A^]|| + max |e f | ^dihn-^E^DiD'^ -» P 0, 

since ||S-ao||| ~p V n > H E nAA;|| <P 1 by Step 4, andmax^ n Hcy^n -1 -^ P (bymax i;Cri ||dj|| 2 <p 

n > g for q > 2, holding by E[[|<ii [Ig] < 1 and inequality (F.35)) and max^ n 

(by maxi <n [||di|| 2 |ei|] < P n 1 '* for q > 2 holding by E[( || ^ || 2 1 1 2 ) g ] < 1 and inequality |R35| ). 

Next we note that 

||E n [e?A^] -E„[e J 2 AA']ll = ||E n [ef A(A - A)' + e?(A - A) A] +E n [ef(A - A)(A - A)']|| 
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which is bounded up to a constant by 

sfk e max \\Du- Dii\\2 } n\\ej\\Di\\ 2 \\2,n + k e max ||Az - A2II2 n maxe 2 0. 



The latter occurs because 1 1 e? 1 1 -O^ 1 1 2 1 1 a,n = y lEn[ef || A|||] <p 1 by E[ef ||A|||] uniformly bounded 
in n by Condition SM and by Markov inequality, and 

max HA, - AllL.maxe? < P ^ ^ 

where the latter step holds by Step and by max^ n e? <p n 2//<?E holding by E[e^ £ ] < 1 and 
inequality (F.35). Finally, E n [e 2 AA 7 ] - E[e 2 AA^] ->-p by the von Bahr-Essen inequality (von 
Bahr and Esseen, 1965) and by E[|ej| 2+/i || A|l2 +At ] bounded uniformly in n for small enough 
H > by Condition SM. 

We conclude that E n [e 2 A5-] - E[e 2 AA^] 0. □ 



F.2. Proof of Theorem [HJ Step 1. To establish claim (1), using the properties of projection 
we note that 

nE n [eJij] = nE n [eJij]. (F.36) 

Since for /2 e = (E ri [io i u;^) _1 E n [7i; i ei] we have ||ju e || 2 ^ ||E„[ii;jU$ _1 ||||E n [i(;jej]||2, where ||E n [wiW-] _1 1 
is bounded by SM2(ii) and ||E n [u7jej] H2 is of stochastic order y/k w jn by Chebyshev inequality 
and SM2(ii). Hence ||/I e ||2 ^p \/k w /n. Since ||ifi|| 2 ^ ( w by Condition SM2(i), we conclude 
that maxjsg n |^ju e | <p CwVkw/V™ ~~ ^ 0- Hence, uniformly in j £ {1, 



(a) 



E n [e 2 / 2 ] - A /E n [6 2 £ 2 .]| W A /E,[K/2 e ) 2 ^-] = o P (l)J^ n [fl] ( i 0P (1), 



(F.37) 



where (a) is by the triangular inequality and the decomposition e« = — w^/2 e , (b) is by the 



Holder inequality, and (c) is by the normalization y K n [f^] = 1 for each j. Hence, for c > 1, by 



(F.36) and (F.37) wp -> 1 



A ai ^ cA Ql , A ai := max n|E„[ej/y] 



En[6 2 ^] 



Since A ai is a maximum of self-normalized sum of i.n.i.d. terms conditional on X, application 
of SM2(iii)-(iv) and the moderate deviation bound from Lemma [5] for the self-normalized sum 
with Uij = €ifij, conditional on X, implies that P(cA ai ^ A(l — 7)) ^ 1 — 7 — o(l). This verifies 
claim (i). 

Step 2. To show claim (2) we note that using triangular and other elementary inequalities: 



A Q = max 



n|E n [(ej - (a - ai)'d e i)/j. 



E n [(e~ l -(a-a 1 )'J el ) 2 B; 



max 



»|E n [(a - ai)'d e ifij] 



E„ [ef /?,] + , /E n [{ (a -a a 



A ai . 
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The first term on the right side is bounded below by, wp — > 1, 

n|E n [(a - ai)'d ei fij]\ 
max — -j^=^= , 

^ c^ n [efff.\ + ^E n [{(a - ai)'4i} 2 ^] 

by Step 1 for some c > 1, and A ai <p -y/n \og(p/~f) also by Step 1. Hence for any constant C, 
by the last condition in the statement of the theorem, with probability converging to 1, A a — 
Cy/n log(p/7) — > +oo, so that Claim (2) immediately follows, since A(l — 7) < \Jn log(p/7). □ 

F.3. Proof of Theorems [6] and [7} See Online Appendix. 



References 

Amemiya, T. (1966): "On the Use of Principal Components of Independent Variables in Two-Stage Least-Squares 
Estimation," International Economic Review, 7, 283-303. 

Amemiya, T. (1974): "The Non-linear Two-Stage Least Squares Estimator," Journal of Econometrics, 2, 105-110. 

Anderson, T. W., AND H. Rubin (1949): "Estimation of the Parameters of Single Equation in a Complete 
System of Stochastic Equations," Annals of Mathematical Statistics, 20, 46-63. 

Andrews, D. W. K., M. J. Moreira, AND J. H. Stock (2006): "Optimal Two-Sided Invariant Similar Tests 
for Instrumental Variables Regression," Econometrica, 74(3), 715-752. 

Andrews, D. W. K., AND J. H. Stock (2005): "Inference with Weak Instruments," Cowles Foundation Dis- 
cussion Paper No. 1530. 

Angrist, J. D., AND A. B. Krueger (1995): "Split-Sample Instrumental Variables Estimates of the Return to 
Schooling," Journal of Business & Economic Statistics, 13(2), 225-235. 

Bai, J., AND S. Ng (2008): "Forecasting Economic Time Series Using Targeted Predictors," Journal of Econo- 
metrics, 146, 304-317. 

(2009a): "Boosting Diffusion Indices," Journal of Applied Econometrics, 24. 

(2009b): "Selecting Instrumental Variables in a Data Rich Environment," Journal of Time Series Econo- 
metrics, 1(1). 

(2010): "Instrumental Variable Estimation in a Data Rich Environment," Econometric Theory, 26, 

15771606. 

Bekker, P. A. (1994): "Alternative Approximations to the Distributions of Instrumental Variables Estimators," 
Econometrica, 63, 657-681. 

Belloni, A., D. Chen, V. Chernozhukov, AND C. Hansen (2010): "Sparse Models and Methods for Optimal 
Instruments with an Application to Eminent Domain," Arxiv Working Paper. 

Belloni, A., AND V. Chernozhukov (2011a): "^-Penalized Quantile Regression for High Dimensional Sparse 
Models," Annals of Statistics, 39(1), 82-130. 

Belloni, A., AND V. Chernozhukov (2011b): "High-Dimensional Sparse Econometric Models, an Introduc- 
tion," Inverse Problems and High- Dimensional Estimation, Springer Lecture Notes in Statistics, pp. 121-156. 

Belloni, A., AND V. Chernozhukov (2012): "Least Squares After Model Selection in High-dimensional Sparse 
Models," forthcoming at Bernoulli, ArXiv Working Paper posted in 2009. 

Belloni, A., V. Chernozhukov, AND C. Hansen (2011a): "Estimation and Inference Methods for High- 
Dimensional Sparse Econometric Models," Advances in Economics and Econometrics, 10th World Congress of 
Econometric Society. 

(2011b): "Inference on Treatment Effects after Selection Amongst High-Dimensional Controls with an 

Application to Abortion on Crime," ArXiv working paper. 



48 



BELLONI CHEN CHERNOZHUKOV HANSEN 



Belloni, A., V. Chernozhukov, AND L. Wang (2011a): "Pivotal Estimation of Nonparametric Functions via 

Square-root Lasso," ArXiv Working Paper. 

(2011b): "Square-Root-LASSO: Pivotal Recovery of Sparse Signals via Conic Programming," Biometrika. 

Bickel, P. J., Y. Ritov, AND A. B. Tsybakov (2009): "Simultaneous analysis of Lasso and Dantzig selector," 

Annals of Statistics, 37(4), 1705-1732. 
Brodie, J., I. Daubechies, C. D. Mol, D. Giannone, AND I. Loris (2009): "Sparse and stable Markowitz 

portfolios," PNAS, 106(30), 12267-12272. 
Buhlmann, P. (2006): "Boosting for high-dimensional linear models," Ann. Statist., 34(2), 559-583. 
Buhlmann, P., AND S. VAN de Geer (2011): Statistics for High- Dimensional Data: Methods, Theory and 

Applications. Springer. 

Bunea, F., A. Tsybakov, AND M. H. Wegkamp (2007a): "Sparsity oracle inequalities for the Lasso," Electronic 

Journal of Statistics, 1, 169-194. 
Bunea, F., A. B. Tsybakov, AND M. H. Wegkamp (2006): "Aggregation and Sparsity via £i Penalized Least 

Squares," in Proceedings of 19th Annual Conference on Learning Theory (COLT 2006) (G. Lugosi and H. U. 

Simon, eds.), pp. 379-391. 

(2007b): "Aggregation for Gaussian regression," The Annals of Statistics, 35(4), 1674-1697. 

Candes, E., AND T. TAO (2007): "The Dantzig selector: statistical estimation when p is much larger than n," 

Ann. Statist, 35(6), 2313-2351. 
Caner, M. (2009): "LASSO-Type GMM Estimator," Econometric Theory, 25, 270-290. 

CARRASCO, M. (2012): "A Regularization Approach to the Many Instruments Problem," forthcoming in Journal 
of Econometrics. 

CARRASCO, M., AND G. Tchuente Nguembu (2012): "Regularized LIML with Many Instruments," Discussion 

paper, University of Montreal Working paper. 
Chamberlain, G. (1987): "Asymptotic Efficiency in Estimation with Conditional Moment Restrictions," Journal 

of Econometrics, 34, 305-334. 
Chamberlain, G., AND G. Imbens (2004): "Random Effects Estimators with Many Instrumental Variables," 

Econometrica, 72, 295-306. 

Chao, J., AND N. SwANSON (2005): "Consistent Estimation With a Large Number of Weak Instruments," 

Econometrica, 73, 1673-1692. 
Chen, D. L., AND J. Sethi (2010): "Does Forbidding Sexual Harassment Exacerbate Gender Inequality," 

unpublished manuscript. 

Chen, D. L., AND S. Yeh (2010): "The Economic Impacts of Eminent Domain," unpublished manuscript. 
Chernozhukov, V., AND C. Hansen (2008a): "Instrumental Variable Quantile Regression: A Robust Inference 

Approach," Journal of Econometrics, 142, 379-398. 
(2008b): "The Reduced Form: A Simple Approach to Inference with Weak Instruments," Economics 

Letters, 100, 68-71. 

de la Pena, V. H., T. L. Lai, AND Q.-M. Shao (2009): Self-normalized processes, Probability and its Appli- 
cations (New York). Springer- Verlag, Berlin, Limit theory and statistical applications. 

DeMiguel, V., L. Garlappi, F. Nogales, AND R. Uppal (2009): "A generalized approach to portfolio 
optimization: improving performance by constraining portfolio norms," Manage. Sci., 55(5), 798812. 

Donald, S. G., AND W. K. Newey (2001): "Choosing the Number of Instruments," Econometrica, 69(5), 
1161-1191. 

Fuller, W. A. (1977): "Some Properties of a Modification of the Limited Information Estimator," Econometrica, 
45, 939-954. 

Gautier, E., AND A. B. Tsybakov (2011): "High-Dimensional Instrumental Variables Regression and Confi- 
dence Sets," ArXiv working report. 



SPARSE MODELS AND METHODS FOR OPTIMAL INSTRUMENTS 



49 



Hahn, J. (2002): "Optimal Inference with Many Instruments," Econometric Theory, 18, 140-168. 

Hahn, J., J. A. HAUSMAN, AND G. M. Kuersteiner (2004): "Estimation with Weak Instruments: Accuracy 
of Higher-order Bias and MSE Approximations," Econometrics Journal, 7(1), 272-306. 

Hansen, C, J. Hausman, AND W. K. Newey (2008): "Estimation with Many Instrumental Variables," Journal 
of Business and Economic Statistics, 26, 398-422. 

Hausman, J., W. Newey, T. Woutersen, J. Chao, AND N. Swanson (2009): "Instrumental Variable Esti- 
mation with Heteroskedasticity and Many Instruments," mimeo. 

Huang, J., J. L. Horowitz, AND F. Wei (2010): "Variable selection in nonparametric additive models," Ann. 
Statist, 38(4), 2282-2313. 

JlNG, B.-Y., Q.-M. Shao, AND Q. Wang (2003): "Self-normalized Cramr-type large deviations for independent 

random variables," Ann. Probab., 31(4), 2167-2215. 
Kapetanios, G., L. Khalaf, AND M. Marcellino (2011): "Factor based identification-robust inference in IV 

regressions," working paper. 

Kapetanios, G., AND M. Marcellino (2010): "Factor-GMM estimation with large sets of possibly weak 
instruments," Computational Statistics & Data Analysis, 54(11), 26552675. 

Kleibergen, F. (2002): "Pivotal Statistics for Testing Structural Parameters in Instrumental Variables Regres- 
sion," Econometrica, 70, 1781-1803. 

(2005): "Testing Parameters in GMM Without Assuming That They Are Identified," Econometrica, 73, 

1103-1124. 

Kloek, T., AND L. Mennes (1960): "Simultaneous Equations Estimation Based on Principal Components of 

Predetermined Variables," Econometrica, 28, 45-61. 
Knight, K. (2008): "Shrinkage Estimation for Nearly Singular Designs," Econometric Theory, 24, 323-337. 
Koltchinskii, V. (2009): "Sparsity in penalized empirical risk minimization," Ann. Inst. H. Poincare Probab. 

Statist, 45(1), 7-57. 

Ledoux, M., AND M. Talagrand (1991): Probability in Banach Spaces (Isoperimetry and processes). Ergebnissc 
der Mathematik undihrer Grenzgebiete, Springer- Verlag. 

Lounici, K. (2008): "Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estima- 
tors," Electron. J. Statist, 2, 90-102. 

Lounici, K., M. Pontil, A. B. Tsybakov, AND S. van de Geer (2010): "Taking Advantage of Sparsity in 
Multi-Task Learning," arXiv:0903.1468vl [stat.ML]. 

Meinshausen, N., AND B. Yu (2009): "Lasso-type recovery of sparse representations for high-dimensional data," 
Annals of Statistics, 37(1), 2246-2270. 

Moreira, M. J. (2003): "A Conditional Likelihood Ratio Test for Structural Models," Econometrica, 71, 1027- 
1048. 

Newey, W. K. (1990): "Efficient Instrumental Variables Estimation of Nonlinear Models," Econometrica, 58, 
809-837. 

(1997): "Convergence Rates and Asymptotic Normality for Series Estimators," Journal of Econometrics, 

79, 147-168. 

Okui, R. (2010): "Instrumental Variable Estimation in the Presence of Many Moment Conditions," forthcoming 
Journal of Econometrics. 

Rosenbaum, M., AND A. B. Tsybakov (2008): "Sparse recovery under matrix uncertainty," arXiv:0812.2818vl 
[math. ST]. 

Rosenthal, H. P. (1970): "On the subspaces of LP (p > 2) spanned by sequences of independent random 
variables," Israel J. Math., 9, 273-303. 

Rudelson, M., AND R. Vershynin (2008): "On sparse reconstruction from Fourier and Gaussian measure- 
ments," Communications on Pure and Applied Mathematics, 61, 10251045. 



50 



BELLONI CHEN CHERNOZHUKOV HANSEN 



Rudelson, M., AND S. Zhou (2011): "Reconstruction from anisotropic random measurements," 
ArXiv.U06.U51. 

Staiger, D., AND J. H. Stock (1997): "Instrumental Variables Regression with Weak Instruments," Econo- 
metrica, 65, 557-586. 

Stock, J. H., J. H. Wright, AND M. Yogo (2002): "A Survey of Weak Instruments and Weak Identification 

in Generalized Method of Moments," Journal of Business and Economic Statistics, 20(4), 518-529. 
Stock, J. H., AND M. YOGO (2005): "Testing for Weak Instruments in Linear IV Regression," in Identification 

and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg, ed. by D. W. K. Andrews, and 

J. H. Stock, chap. 5, p. 80108. Cambridge: Cambridge University Press. 
Tibshirani, R. (1996): "Regression shrinkage and selection via the Lasso," J. Roy. Statist. Soc. Ser. B, 58, 

267-288. 

VAN de Geer, S. A. (2008): "High-dimensional generalized linear models and the lasso," Annals of Statistics, 
36(2), 614-645. 

VAN der Vaart, A. W., AND J. A. Wellner (1996): Weak Convergence and Empirical Processes. Springer 
Series in Statistics. 

VON Bahr, B., AND C.-G. Esseen (1965): "Inequalities for the rth absolute moment of a sum of random 

variables, 1 < r sC 2," Ann. Math. Statist, 36, 299-303. 
Wainwright, M. (2009): "Sharp thresholds for noisy and high-dimensional recovery of sparsity using i\- 

constrained quadratic programming (Lasso)," IEEE Transactions on Information Theory, 55, 2183-2202. 
Zhang, C.-FL, AND J. Huang (2008): "The sparsity and bias of the lasso selection in high-dimensional linear 

regression," Ann. Statist, 36(4), 1567-1594. 



SPARSE MODELS AND METHODS FOR OPTIMAL INSTRUMENTS 



51 



Table 1. Simulation Results. 



Estimator 



Exponential 
Median 
N(0) Bias MAD 



S = 5 



Median 



rp(.05) N{0) 



MAD rp(.05) N(0) 



S = 
Median 
Bias 
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MAD 



rp(.05) 



A. Concentration Parameter = 30, n = 100 



2SLS(100) 




0.524 


0.524 


1.000 




0.520 


0.520 


1.000 




0.528 


0.528 


0.998 


FULL(IOO) 




0.373 


0.741 


0.646 




0.476 


0.781 


0.690 




0.285 


0.832 


0.580 


Post-LASSO 


483 


0.117 


0.183 


0.012 


485 


0.128 


0.178 


0.008 


498 


0.363 


0.368 


0.012 


Post-LASSO-F 


483 


0.117 


0.184 


0.012 


485 


0.128 


0.178 


0.008 


498 


0.363 


0.368 


0.012 


Post-LASSO (Ridge) 


500 


0.229 


0.263 


0.000 


500 


0.212 


0.239 


0.000 


500 


0.362 


0.364 


0.002 


Post-LASSO-F (Ridge) 


500 


0.229 


0.263 


0.000 


500 


0.212 


0.239 


0.000 


500 


0.362 


0.364 


0.002 


sup-Score 








0.006 








0.000 








0.008 










B. 


Concentration Parameter = 


■■ 30, n = 250 








2SLS(100) 




0.493 


0.493 


1.000 




0.485 


0.485 


1.000 




0.486 


0.486 


1.000 


FULL(IOO) 




0.028 


0.286 


0.076 




0.023 


0.272 


0.056 




0.046 


0.252 


0.072 


Post-LASSO 


396 


0.106 


0.163 


0.044 


423 


0.105 


0.165 


0.042 


499 


0.358 


0.359 


0.008 


Post-LASSO-F 


396 


0.107 


0.164 


0.048 


423 


0.105 


0.166 


0.044 


499 


0.358 


0.359 


0.008 


Post-LASSO (Ridge) 


500 


0.191 


0.223 


0.004 


500 


0.196 


0.217 


0.006 


500 


0.353 


0.355 


0.000 


Post-LASSO-F (Ridge) 


500 


0.191 


0.223 


0.004 


500 


0.196 


0.217 


0.006 


500 


0.353 


0.355 


0.000 


sup-Score 








0.002 








0.010 








0.006 










C. 


Concentration Parameter = 


180, n = 


100 








2SLS(100) 




0.353 


0.353 


0.952 




0.354 


0.354 


0.958 




0.350 


0.350 


0.948 


FULL(IOO) 




0.063 


0.563 


0.648 




0.096 


0.562 


0.694 




0.148 


0.538 


0.656 


Post-LASSO 


120 


0.037 


0.093 


0.078 


132 


0.035 


0.100 


0.052 


498 


0.192 


0.211 


0.000 


Post-LASSO-F 


120 


0.030 


0.093 


0.070 


132 


0.025 


0.100 


0.046 


498 


0.192 


0.211 


0.000 


Post-LASSO (Ridge) 


500 


0.061 


0.132 


0.002 


500 


0.063 


0.116 


0.000 


500 


0.004 


0.119 


0.000 


Post-LASSO-F (Ridge) 


500 


0.061 


0.132 


0.002 


500 


0.063 


0.116 


0.000 


500 


0.004 


0.119 


0.000 


sup-Score 








0.002 








0.002 








0.000 










D. 


Concentration Parameter = 


180, n = 


250 








2SLS(100) 




0.289 


0.289 


0.966 




0.281 


0.281 


0.972 




0.280 


0.280 


0.964 


FULL(IOO) 




0.008 


0.082 


0.058 




0.007 


0.081 


0.044 




0.008 


0.083 


0.048 


Post-LASSO 





0.032 


0.073 


0.054 





0.019 


0.067 


0.060 


411 


0.233 


0.237 


0.044 


Post-LASSO-F 





0.024 


0.069 


0.038 





0.014 


0.068 


0.046 


411 


0.235 


0.236 


0.040 


Post-LASSO (Ridge) 


211 


0.062 


0.095 


0.098 


225 


0.058 


0.084 


0.082 


295 


-0.008 


0.090 


0.030 


Post-LASSO-F (Ridge) 


211 


0.061 


0.096 


0.082 


225 


0.056 


0.081 


0.062 


295 


-0.004 


0.090 


0.032 


sup-Score 








0.012 








0.012 








0.012 



Note: Results are based on 500 simulation replications and 100 instruments. Column labels indicate the structure of the first-stage 
coefficients as described in the text. 2SLS(100) and FULL(100) are respectively the 2SLS and Fuller(l) estimators using all 100 potential 
instruments. Post-LASSO and Post-LASSO-F respectively correspond to 2SLS and Fuller(l) using the instruments selected from LASSO 
variable selection among the 100 instruments with inference based on the asymptotic normal approximation; in cases where no instruments 
are selected, the procedure switches to using the sup-Score test for inference. sup-Score provides the rejection frequency for a weak 
identification robust procedure that is suited to situations with more instruments than observations. Post-LASSO (Ridge) and Post-LASSO-F 
(Ridge) are defined as Post-LASSO and Post-LASSO-F but augment the instrument set with a fitted value obtained via ridge regression as 
described in the text. We report the number of replications in which LASSO selected no instruments (N(0)), median bias (Med. Bias), median 
absolute deviation (MAD), and rejection frequency for 5% level tests (rp(.05)). In cases where LASSO selects no instruments, Med. Bias, and 
MAD are based on 2SLS using the single instrument with the largest sample correlation to the endogenous variable and rp(.05) is based on 
the sup-Score test. 
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Exponential, n = 250 Cut-Off, s = 5, n = 250 Cut-Off, s = 50, n = 250 




Figure 1. Size-adjusted power curves for Post-Lasso-F (dot-dash), Post-Lasso-F 
(Ridge) (dotted), FULL(IOO) (dashed), and sup-Score (solid) from the simulation 
example with concentration parameter of 180 for n = 100 and n = 250. 
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Table 2: Effect of Federal Appellate Takings Law Decisions on Economic Outcomes 





log(FHFA) 


Home Prices 
log(Non-Metro) 


log(Case-Shiller) 


GDP 
log(GDP) 


Sample Size 


312 


110 


183 


312 


OLS 


0.0114 


0.0108 


0.0152 


0.0099 


s.e. 


0.0132 


0.0066 


0.0132 


0.0048 


2SLS 


0.0262 


0.0480 


0.0604 


0.0165 


s.e. 


0.0441 


0.0212 


0.0296 


0.0162 


FS-W 


28.0859 


82.9647 


67.7452 


28.0859 


K0SI-LA33U 


U.Uooy 


n 031:7 


U.Ubol 


U.Uloo 


s.e. 


0.0465 


0.0132 


0.0249 


0.0161 


FS-W 


44.5337 


243.1946 


89.5950 


44.5337 


S 


1 


4 


2 


1 


Post-LASSO+ 


0.0314 


0.0348 


0.0628 


0.0144 


s.e. 


0.0366 


0.0127 


0.0245 


0.0131 


FS-W 


73.3010 


260.9823 


105.3206 


73.3010 


S 


3 


6 


3 


3 


Spec. Test 


-0.2064 


0.5753 


-0.0985 


0.1754 



Note: This table reports the estimated effect of an additional pro-plaintiff takings decision, a decision 
that goes against the government and leaves the property in the hands of the private owner, on 
various economic outcomes using two-stage least squares (2SLS). The characteristics of randomly 
assigned judges serving on the panel that decides the case are used as instruments for the decision 
variable. All estimates include circuit effects, circuit-specific time trends, time effects, controls for the 
number of cases in each circuit-year, and controls for the demographics of judges available within each 
circuit-year. Each column corresponds to a different dependent variable. log(FHFA), log(Non-Metro), 
and log(Case-Shiller) are within-circuit averages of log-house-price-indexes, and log(GDP) is the within- 
circuit average of log of state-level GDP. OLS are ordinary least squares estimates. 2SLS is the 2SLS 
estimator with the original instruments in Chen and Yeh (2010). Post-LASSO provides 2SLS estimates 
obtained using instruments selected by LASSO with the refined data-dependent penalty choice. Post- 
LASSO+ uses the union of the instruments selected by Lasso and the instruments of Chen and Yeh 
(2010). Rows labeled s.e. provide the estimated standard errors of the associated estimator. All 
standard errors are computed with clustering at the circuit-year level. FS-W is the value of the first- 
stage Wald statistic using the selected instrument. S is the number of instruments used in obtaining 
the 2SLS estimates. Hausman test is the value of a Hausman test statistic comparing the 2SLS estimate 
of the effect of takings law decisions using the Chen and Yeh (2010) instruments to the estimated 
effect using the LASSO-selected instruments. 
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Appendix A. Tools 



A.l. Lyapunov CLT, Rosenthal Inequality, and Von Bahr-Esseen Inequality. 

Lemma 1 (Lyapunov CLT). Let {Xi jH ,i = 1, ...,n} be independent zero-mean random variables 
with variance sf n , n = 1,2,.... Define s\ = Y17=i s in- V f or some M > 0, the Lyapunov's 
condition holds: 



1 n 



then as n goes to infinity: 



1 n 

-J2 X i,n AA(0, 1). 



i=l 



Lemma 2 (Rosenthal Inequality). Let Xi,...,X n be independent zero-mean random variables, 
then for r ^ 2 

n r \ I" n r n ^ r / 2 ~ 

J> t Uc(r)max £ E (W)' J>(X 2 ) 



E 



This is due to Rosenthal (1970). 

Corollary 1. Zei r ^ 2, and consider the case of independent zero-mean variables Xi with 
EE n (X 4 2 ) = 1 andEE n (\Xi\ r ) bounded by C. Then for any £ n — > oo 



Pr 



2C(r)C 



Br 



0. 



To verify the corollary, we use Rosenthal's inequality E (|X)£=i -^D ^ Cn r / 2 , and the result 
follows by Markov inequality, 

c r n r/2 



n 



c' n' 



Lemma 3 (von Bahr-Essen Inequality). Xei Xl, ...,X„ 6e independent zero-mean random vari- 
ables. Then for 1 ^ r ^ 2 



E 



i=l 



< (2-n- 1 )-^E(|X fe | 



fe=i 



This result is due to von Bahr and Esseen (1965). 
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Corollary 2. Let r £ [1,2], and consider the case of independent zero-mean variables X-i with 
EE n (|Xj| r ) bounded by C . Then for any £ n — > oo 



The corollary follow by Markov and von Bahr-Esseen's inequalities, 



n y c'n' c'n' c r n r 1 

A. 2. A Symmetrization-based Probability Inequality. Next we proceed to use symmetriza- 
tion arguments to bound the empirical process. Let ||/||p n)2 = \Z^n[f(Zi) 2 ], G n (f) = y / nE„[/(Zj)- 
E[Zj]], and for a random variable Z let q(Z, 1 — r) denote its (1 — r)-quantile. The proof follows 
standard symmetrization arguments. 

Lemma 4 (Maximal inequality via symmetrization) . Let Z\ , . . . , Z n be arbitrary independent 
stochastic processes and T a finite set of measurable functions. For any r G (0,1/2), and 
5 G (0, 1) with probability at least 1 — At — 45 we have 



sup |G„(/)| ^ 4V21og(2|J-|/%(sup ||/|| Pn , 2 , 1 - r) V 2 sup g(|G„(/)|, 1/2). 

Proof. Let 

ei n = V21og(2|J-|/5) 9 (max ^E n [/(^) 2 ], 1 - r Y e 2ri = maxg (\G n (f(Zi))\, 

and the event £ = {maxj e jr a/E„ [f 2 {Z,i)\ ^ g ^maxj e jr a/E„ [/ 2 (Zj)], 1 — r^} which satisfies 
P(<?) ^ 1 — t. By the symmetrization Lemma 2.3.7 of van der Vaart and Wellner (1996) (by 
definition of e2n we have (3 n (x) ^ 1/2 in Lemma 2.3.7) we obtain 

P{max /eJ r|G n (/(Z i ))| > 4e ln V 2e 2n } < 4P {max /gJ |G„(eJ(^))| > e ln } 

^ 4P{max /e ^|G„(e i /(Z i ))| > ei n |£} + 4r 

where Si are independent Rademacher random variables, P(e« = 1) = P(Ei = —1) = 1/2. 
Thus a union bound yields 



max|G„(/(Zi))| > 4ei„ V 2e 2ri \ < 4r + 4|JF| maxP {|G„(£;/(^))| >ei„|f}. (A.38) 

We then condition on the values of Zi,...,Z n and £, denoting the conditional probability 
measure as F £ . Conditional on Zi, . . . , Z n , by the Hoeffding inequality the symmetrized process 
&n(£if(Zi)) is sub-Gaussian for the L 2 (P n ) norm, namely, for /G J, P £ {|G n (ej/(Zj))| > x} ^ 
2 exp(— x 2 /{2E n [f 2 (Zi)]}). Hence, under the event £, we can bound 

¥ E {\G n (e i f(Z i ))\>e ln \Z 1 ,...,Z n ,£} < 2exp(-e 2 n /[2E n [/ 2 (Z i )]) < 2exp(-log(2|.F|/<S)). 

Taking the expectation over Z±, . . . , Z n does not affect the right hand side bound. Plugging in 
this bound yields the result. □ 
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Appendix B. Proof of Theorem 6. 

Proof of Theorem 6. To show part 1, note that by a standard argument 

v^(a - a) = M^GniAid] + o P (l). 

From the proof of Theorem 4 we have that 

V^(2 - a a ) = Q- 1 G n [ J D(x i )e i ] + o P (l). 

The conclusion follows. The consistency of £ for £ can be demonstrated similarly to the proof 
of consistency of Q and Q in the proof of Theorem 4. 

To show part 2, let a denote the true value as before, which by assumption coincides with the 
estimand of the baseline IV estimator by the standard argument, a — a = op(l). The baseline 
IV estimator is consistent for this quantity. Under the alternative hypothesis, the estimand of 
a is 

a a = E[ J D(x i )<]- 1 E[ J D(^)y i ] = a + E^x^^T^^)^]- 

Under the alternative hypothesis, ||E[D(xj)ej] ||2 is bounded away from zero uniformly in n. 
Hence, since the eigenvalues of Q are bounded away from zero uniformly in n, \\a — a a \\2 is also 
bounded away from zero uniformly in n. Thus, it remains to show that a is consistent for a a . 
We have that 

a - a a = E n [D(x i )d' i \~ 1 E n [D(x i )ei] - E^x;)^)']"^^)^ 

so that 

||S - a || 2 ^ IIE^XiK] -1 " V[D(x l )D(x i y]- 1 \\\\E n [D(x i )e l ]\\ 2 + (B.39) 
+ HE^^)^^)]- 1 !!!^,,^^)^] - E[D(x l )e t }\\2 = o P (l), (B.40) 

provided that (i) ||E n [D(xj)<i^] _1 — E[D(xj)Z)(a;j) / ]~ 1 || = op(l), which is shown in the proof 
of Theorem 4; (ii) ||E[D(xi).D(xi)] _1 1| = ||Q _1 || is bounded from above uniformly in n, which 
follows from the assumption on Q in Theorem 4; and (iii) 

\\E n [D(x t )ei] - E[D(xi)€i]\\2 = o P (l), \\E[D(x l )e i ] || 2 = O(l), 

where ||E[.D(xi)ei] || 2 = 0(1) is assumed. To show the last claim note that 

\\E n [D(xi)ei] - E[D( Xi )ei] || 2 s: ||E n [D(xi)ei] - E n [D( Xi )ei] || 2 + ||E n [Z)(x i )e J ] - E[D( Xi )ei] || 2 

sC v/^max^^^ \\Di(Xi) - A(£j)||2,n|Mkn + Op(l) = Op(l), 

since k e is fixed, ||E„[D(xj)ej] — E[D(xj)ej]|| 2 = op(l) by von Bahr-Essen inequality von Bahr 
and Esseen (1965) and SM, ||ej|| 2jn = Op(l) follows by the Markov inequality and assumptions 
on the moments of £j, and max^/.^ ||D;(xj) — £^(xj)|| 2ira = op(l) follows from Theorems 1 and 
2. □ 
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Appendix C. Proof of Theorem 7 



We introduce additional superscripts a and b on all variables; and n gets replaced by either 
n a or ni,. The proof is divided in steps. Assuming 



max \\D$-D§h, nk < P J - l ° g(P V n) = o P (l), fe = a,6, (C.41) 

lsgZsgfee v n 

Steps 1 establishes bounds for the intermediary estimates on each subsample. Step 2 establishes 
the result for the final estimator a a b and consistency of the matrices estimates. Finally, Step 3 



establishes (C.41). 



Step 1. We have that 



= {En.lD^']}- 1 (G nk [D^} + o P (l] 

= {E n jA^]+op(i)r 1 (G nk [Dt4]+o P {i: 



since 



E nk {Df4'\ = E nk [DiD^ +o P (l) (C.42) 
VnE nk [D^\ = G nk [nM\ + op(1). (C.43) 



Indeed, (C.42 ) follows similarly to Step 2 in the proof of Theorems 4 and 5 and condition (C.41 ). 



The relation (C.43) follows from E[ef|xf] = for both k = a and k = b, Chebyshev inequality 
and 

E[\\^m nh [(Dt-D*) ( *]\\l\x!,i = l,...,n,k c ] < max \\{D k a - D k a )\\\ nk ^ P 0, 

where E[-\x^,i = l,...,n, k c ] denotes the estimate computed conditional x*,i = l,...,n and on 
the sample k c , where k c = {a, b} \ k. The bound follows from the fact that (a) 

D\ - D\ = ftfYffi - ^) - ai(xj), 1 < i < n k , 

by Condition AS, where (/3^ c — /3q Z c ) are independent of {e^, 1 ^ i ^ n k }, by the independence of 
the two subsamples k and k c , (b) {ef,Xi, 1 ^ i ^ n k } are independent across i and independent 
from the sample fc c , (c) {e^, 1 ^ i ^ n k } have conditional mean equal to zero, conditional on 
x^,i = 1, ...,n and have conditional variance bounded from above, uniformly in n, conditional 
on x\, i = 1, ...,n, by Condition SM, and (d) that maxi^^j. e \\Dfi — ^Ih.nj ~~ >P 0. 

Using the same arguments as in Step 1 in the proof of Theorems 4 and 5, 

^rT k (a k -ao) = (E nk [D^Df])- 1 G nk [D^] + o P (l) = P (l). 
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Step 2. Now putting together terms we get 

^i(a ab -a ) = ((n a /n)E na [DfD? } + (n 6 /n)E n J J Df.Df])- 1 x 

x ((n a /n)E na [DfDt']V^{a a - a ) + (n b /n)E nb \D\d\' ]^(a b - a )) 

= ((n a /n)E na [D?Df] + (n 6 /n)E n JA 6 A 6 ']) _1 * 

x ((n a /n)E na [D?Df']Vn(a a - a ) + (nj/^M^ai - a )) + o P (l) 

= {EntAA']}" 1 x {(1/V2) x G no [D?e?] + (l/V2)G n6 [£)f e ^]} + o P (l) 

= {E n [AA']} _1 x G„[Aei] + op(1) 



where we are also using the fact that 

E n J^5f]-E„ fc [A^f]=«p(l), 

which is shown similarly to the proofs given in Theorem 4 for showing that 'K n [DiD i '] — 
~K n [DiD?] = op(l). The conclusion of the theorem follows by an application of Liapunov CLT, 
similarly to the proof in of Theorem 4 of the main text. 



Step 3. In this step we establish (C.41). For every observation i and I = l,...,k e , by 
Condition AS we have 

Du = f'iPm + ai(xi), HAollo < s, max ||aj(sci)[| 2 ,„ < c s < P y/s/n. (C.44) 
Under our conditions, the sparsity bound for LASSO by Lemma 9 implies that for all Sf = 



$l ~ Ao> k = a,b, and I = 1, . . . , k e 



\Si\\o s - 



Therefore, by condition SE, we have for M = E nfc '], k = a,b, that with probability going 
to 1, for n large enough 

< «' < flW(ll<tfllo)(M) < max (||«5f || )(M) < k" < 00, 



where k' and k" are some constants that do not depend on n. Thus, 
'il 



\D* - D%, nh = Ao + a,(xj) - ftWh 



ll/f(Ao-/Okn fe + IM4)lkn fe 



(C.45) 



where the last inequality holds with probability going to 1 by Condition SE imposed on matrices 
M = E nk [f*f*'},k = a,b, and by 

IKO^jhlknfc < V n / n k\\ai(xi)\\ 2 ,n < \Jnjn k c s . 



Then, in view of (C.44), \/nJnk = \/2 + o(l), and condition slogp = o(n), the result (C.41) 
holds by (C.45) combined with Theorem 1 for LASSO and Theorem 2 for Post-LASSO which 
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imply 

max \\f t k '(M-Mhn k <P J Sl0gi ^ n ^ = o P (l), k = a,b. 

□ 



Appendix D. Proof of Lemma 3. 

Part 1. The first condition in RF(iv) is assumed, and we omit the proof for the third condition 
since it is analogous to the proof for the second condition. 



Note that maxi^pE n [#.u?J ^ (E^fJ) 1 ^ rnax^ Kp (E n [/f.]) l l 2 < P 1 since maxi <Kp ./!„[/, 



1 by assumption and maxi<^fc e yE„[?;|] <p 1 by the bounded k e , Markov inequality, and the 
assumption that E[w^] are uniformly bounded in n and I. 

Thus, applying Lemma HI 



Part 2. To show (1), we note that by simple union bounds and tail properties of Gaussian 
variable, we have that maxy /?■ <p log(p V n), so we need log(p V n) s lo g(P Vn ) _ ). o. 

Therefore maxi^- <p E n [/^.^] < E n [v^] maxi^j^i^p ffj < P log 2 (p V n). Thus, applying 
Lemma |4l 



is£/sgfc e ,i^j^p V n y n 

since logp = o(n 1 / 3 ). The remaining moment conditions of RF(ii) follows immediately from 
the definition of the conditionally bounded moments since for any m > 0, E[|/y| m ] is bounded, 
uniformly in 1 j p, uniformly in n, for the i.i.d. Gaussian regressors of Lemma 1 of Belloni, 
Chen, Chernozhukov, and Hansen (2010). The proof of (2) for arbitrary bounded i.i.d. regressors 
of Lemma 2 of Belloni, Chen, Chernozhukov, and Hansen (2010) is similar. 



Appendix E. Proof of Lemma 4. 



The first two conditions of SM(iii) follows from the assumed rate s 2 log 2 (p V n) = o(n) 
since we have q e = 4. To show part (1), we note that by simple union bounds and tail 
properties of Gaussian variable, we have that maxi^-g^i^^p ff- <p log(p V n), so we need 
\og{ P Vn) slog{pWn) 0. Applying union bound, Gaussian concentration inequalities Ledoux and 
Talagrand (1991), and that log 2 p = o(n), we have maxi^j^ p E n [/^] <p 1. Thus SM(iii)(c) holds 
by max,,-E n [/?.e 2 ] < (E^ef}) 1 / 2 maxi s£Kp (E n [/^]) 1 / 2 < P 1. Part (2) follows because regressors 
are bounded and the moment assumption on e. □ 
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Appendix F. Proof of Lemma 11. 

Step 1. Here we consider the initial option, in which 7^ = K n [f 2 j(du — K n du) 2 ]. Let us define 
da = da - E[d a ], 7^ = E n [/g4] and 7?, = E[/?-^]. We want to show that 

Ai = max |7?; - jl\ -> P 0, A 2 = max - jl\ -*- P 0, 

which would imply that maxi^^i^^ ItJ-j — 7?J — )>p and then since 7^'s are uniformly 
bounded from above by Condition RF and bounded below by t^ 2 = E[/?-v 2 J, which are bounded 
away from zero. The asymptotic validity of the initial option then follows. 

We have that A2 — >p by Condition RF, and, since K n [<i^] = E n [dy] — Efrfy], we have 

Ai = maxi^^^^p %i\fij{{fk ~ ^ndu) 2 - d 2 u }]\ 

< max 1 ^ keA ^j^ p 2\E n [fl j dii]E n [d i i}\ + maxi^k e ,i^j^p |En[/ 2 -](E n (fj Z ) 2 | -»p 0. 

Indeed, we have for the first term that, 



max |E n [/g4]|E n [d a ]| ^ max |/^| JE n [/| )%]Op(l/y/n) 



where we first used Holder inequality and maxi^^ |E n [dj/]| <p ykej^fn by the Chebyshev 
inequality and by E[<2|,] being uniformly bounded by Condition RF; then we invoked Condition 
RF to claim convergence to zero. Likewise, by Condition RF, 

max |E n [/|](E n ^) 2 | SC max |/ 2 |0 P (l/n) ^ P 0. 

Step 2. Here we consider the refined option, in which 7^ = ~K n [ff-v 2 j\. The residual here vu = 
du — Du can be based on any estimator that obeys 



lSg JIA, - A*,,. <r y^^. (F.46) 

Such estimators include the Lasso and Post-Lasso estimators based on the initial option. Below 
we establish that the penalty levels, based on the refined option using any estimator obeying 
(F.46 ), are asymptotically valid. Thus by Theorems [T] and [2j the Lasso and Post-Lasso estimators 



based on the refined option also obey (F.46). This, establishes that we can iterate on the refined 



option a bounded number of times, without affecting the validity of the approach. 

Recall that t° 2 = 'E n [f 2 jV 2 l \ and define 7° 2 := E[/£«?j], which is bounded away from zero and 
from above by assumption. Hence, by Condition RF, it suffices to show that maxi^j^i^/^ 
7^ 2 | — 7-p which implies the loadings are asymptotic valid with u' = 1. This in turn follows 
from 

Ai = max |7 2 ; - 7 °f| ->-p 0, A 2 = max - rf?\ 2 ^ P 0, 



which we establish below. 
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Now note that we have proven A2 — >p in the Step 3 of the proof of Theorem 1, As for Ai 
we note that 

Ai < 2 max \E n [fi jVjl (D a - D a )]\ + max E n [/g(A, - D a ) 2 ]. 
The first term is bounded by Holder and Liapunov inequalities, 

max |/dOM/^]) 1/2 maxHA,- £> a || 2 , n < P max \f ij \QL n \f? j i$\) 1 / 2 J^^X^ _+ p . 

where the conclusion is by Condition RF. The second term is of stochastic order 

2 ,slog(pVn) 
max /» °> 

which converges to zero by Condition RF. □ 
Appendix G. Additional Simulation Results 

In this appendix, we present simulation results to complement the results given in the paper. 
The simulations use the same model as the simulations in the paper: 

Xi = ZjU + Vi 




where /3 = 1 is the parameter of interest, and Z{ = (zn, Znoo)' ~ N(0,T,z) is a 100 x 

1 vector with E[z 2 h ] = a 2 z and Corr(zih, z%j) = .5^~^. In all simulations, we set a 2 = 2 and 
a 2 = 0.3. 

For the other parameters, we consider various settings. We provide results for sample sizes, 
n, of 100, 250, and 500; and we consider three different values for Corr(e,v): 0, .3, and .6. 
We also consider four values of a 2 which are chosen to benchmark four different strengths of 
instruments. The four values of a 2 are found as a 2 = "j^fjf for F*: 2.5, 10, 40, and 160. We 
use two different designs for the first-stage coefficients, II. The first sets the first S elements of II 
equal to one and the remaining elements equal to zero. We refer to this design as the "cut-off" 
design. The second model sets the coefficient on Zih = .7 h ~ l for h = 1, 100. We refer to this 
design as the "exponential" design. In the cut-off case, we consider 5 of 5, 25, 50, and 100 to 
cover different degrees of sparsity. 

For each setting of the simulation parameter values, we report results from seven different 
estimation procedures. A simple possibility when presented with many instrumental variables 
is to just estimate the model using 2SLS and all of the available instruments. It is well-known 
that this will result in poor-finite sample properties unless there are many more observations 
than instruments; see, for example, Bekker (1994). The limited information maximum likeli- 
hood estimator (LIML) and its modification by Fuller (1977) (FULL)Q are both robust to many 

22 Fuller (1977) requires a user-specified parameter. We set this parameter equal to one which produces a 
higher-order unbiased estimator. 
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instruments as long as the presence of many instruments is accounted for when constructing 
standard errors for the estimators; see Bekker (1994) and Hansen, Hausman, and Newey (2008) 
for example. We report results for these estimators in rows labeled 2SLS(100), LIML(IOO), and 
FULL(IOO) respectively^ For LASSO, we consider variable selection based on two different 
sets of instruments. In the first scenario, we use LASSO to select among the base 100 instru- 
ments and report results for the IV estimator based on the LASSO (LASSO) and Post-LASSO 
(Post-LASSO) forecasts. In the second, we use LASSO to select among 120 instruments formed 
by augmenting the base 100 instruments by the first 20 principle components constructed from 
the sampled instruments in each replication. We then report results for the IV estimator based 
on the LASSO (LASSO-F) and Post-LASSO (Post-LASSO-F) forecasts. In all cases, we use 
the refined data-dependent penalty loadings given in the paperj^] For each estimator, we re- 
port root-truncated- mean-squared-error (RMSE) Q median bias (Med. Bias), median absolute 
deviation (MAD), and rejection frequencies for 5% level tests (rp(.05))J^] For computing re- 
jection frequencies, we estimate conventional 2SLS standard errors for 2SLS(100), LASSO, and 
Post-LASSO, and the many instrument robust standard errors of Hansen, Hausman, and Newey 
(2008) for LIML(100) and FULL(100). 
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